The significant progress in both AI and genomics mutually supports the rapid advancements in the medical field. In this study, we will employ various machine learning and AI methods on gene expression data obtained from cancer cells. We aim to comprehend how these cells behave in micro-environmental conditions, specifically focusing on predicting oxygen levels: hypoxia (low) and normoxia (normal). To achieve this, we will construct a model using single-cell RNA sequencing data.
The data that we analyzed was obtained from 4 experiments where two different cancer cell lines, MCF7 and HCC1806, were studied. Each was sequenced by two different RNA sequencing technologies: SMARTSeq and DropSeq.
import sys
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #visualisation
import seaborn as sns #visualisation
%matplotlib inline
sns.set(color_codes=True)
from random import randint
from scipy.stats import kurtosis, skew
import random
random.seed(111)
np.random.seed(111)
########## Unsupervised learning libraries ##########
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.manifold import Isomap
from sklearn.cluster import KMeans
############################################################
########## Supervised learning libraries ##########
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
############################################################
mcf7_smarts_metadata = pd.read_csv("SmartSeq/MCF7_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_smarts_metadata))
print("First column: ", mcf7_smarts_metadata.iloc[ : , 0])
Dataframe dimensions: (383, 8)
First column: Filename
output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam MCF7
output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam MCF7
...
output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam MCF7
output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam MCF7
Name: Cell Line, Length: 383, dtype: object
mcf7_smarts_metadata
| Cell Line | Lane | Pos | Condition | Hours | Cell name | PreprocessingTag | ProcessingComments | |
|---|---|---|---|---|---|---|---|---|
| Filename | ||||||||
| output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A10 | Hypo | 72 | S28 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A11 | Hypo | 72 | S29 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A12 | Hypo | 72 | S30 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A1 | Norm | 72 | S1 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.1 | A2 | Norm | 72 | S2 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.4 | H5 | Norm | 72 | S359 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.4 | H6 | Norm | 72 | S360 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.4 | H7 | Hypo | 72 | S379 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.4 | H8 | Hypo | 72 | S380 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam | MCF7 | output.STAR.4 | H9 | Hypo | 72 | S381 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
383 rows × 8 columns
mcf7_smarts_metadata.shape
(383, 8)
mcf7_smarts_metadata['Cell name'].nunique()
383
The indices of the data frame mcf7_smarts_metadata denote the filenames that represent polyeptid chains defining each cell studied. The same dataframe contains 8 columns:
We can see that each filename is created by combining the information contained in some of the columns. For example, the fist row has the filename output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam which stands for the analysis of the cell S28 in the position A10 that is preprocessed by alining and storing by the coordinates and the oxygen level of the cell Hypoxia.
mcf7_smarts_unfiltered = pd.read_csv("SmartSeq/MCF7_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_smarts_unfiltered))
print("First column: ", mcf7_smarts_unfiltered.iloc[ : , 0])
Dataframe dimensions: (22934, 383)
First column: "WASH7P" 0
"MIR6859-1" 0
"WASH9P" 1
"OR4F29" 0
"MTND1P23" 0
...
"MT-TE" 4
"MT-CYB" 270
"MT-TT" 0
"MT-TP" 5
"MAFIP" 8
Name: "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam", Length: 22934, dtype: int64
mcf7_smarts_unfiltered
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "WASH7P" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| "MIR6859-1" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "WASH9P" | 1 | 0 | 0 | 0 | 0 | 1 | 10 | 1 | 0 | 0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 4 | 5 |
| "OR4F29" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| "MTND1P23" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "MT-TE" | 4 | 0 | 0 | 0 | 3 | 3 | 0 | 0 | 14 | 1 | ... | 0 | 4 | 12 | 4 | 0 | 1 | 6 | 0 | 7 | 4 |
| "MT-CYB" | 270 | 1 | 76 | 66 | 727 | 2717 | 9326 | 3253 | 7949 | 30 | ... | 239 | 3795 | 12761 | 2263 | 1368 | 570 | 3477 | 349 | 2184 | 1149 |
| "MT-TT" | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 4 | 0 | ... | 0 | 7 | 4 | 2 | 0 | 0 | 3 | 0 | 2 | 2 |
| "MT-TP" | 5 | 0 | 0 | 1 | 0 | 1 | 1 | 4 | 2 | 0 | ... | 0 | 14 | 56 | 11 | 2 | 0 | 7 | 2 | 28 | 11 |
| "MAFIP" | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 2 | 1 | 0 | 6 | 0 | 1 | 4 |
22934 rows × 383 columns
Each column of the dataframe mcf7_smarts_unfiltered (383 columns) corresponds to the row of the dataframe mcf7_smarts_meta (383 rows). Thus, in our dataset we know the quantity of the expressed genes in each file (i.e. the aligned polyeptid sequences).
The indices of this unfiltered dataframe represent a gene name (WASH7P, MT-TT, etc.) that is a special ID known as gene symbols. These gene symbols are only acroynms, that might not be unique. Later on we will analyze the correlation between the rows (gene expressions) to understand if we have duplicated gene expressions under different acroynms.
The unfiltered dataframe contains only numeric information:
set(list(mcf7_smarts_unfiltered.dtypes))
{dtype('int64')}
Do we have any missing data? We can look at the null values row by row, but since we have 383 rows, we look at the total sum of the missing values in each row:
mcf7_smarts_unfiltered.isnull().sum().sum()
0
There are no missing values in our dataframe, therefore there is no need for imputation.
We can look at the descriptive statistics of our dataframe. We analyze the distributions of single cells as in the example report:
mcf7_smarts_unfiltered.describe(percentiles=[.05, .25, .5, .75, .95])
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | ... | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 |
| mean | 40.817651 | 0.012253 | 86.442400 | 1.024636 | 14.531351 | 56.213613 | 75.397183 | 62.767725 | 67.396747 | 2.240734 | ... | 17.362562 | 42.080230 | 34.692422 | 32.735284 | 21.992718 | 17.439391 | 49.242784 | 61.545609 | 68.289352 | 62.851400 |
| std | 465.709940 | 0.207726 | 1036.572689 | 6.097362 | 123.800530 | 503.599145 | 430.471519 | 520.167576 | 459.689019 | 25.449630 | ... | 193.153757 | 256.775704 | 679.960908 | 300.291051 | 153.441647 | 198.179666 | 359.337479 | 540.847355 | 636.892085 | 785.670341 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 5% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 17.000000 | 0.000000 | 5.000000 | 0.000000 | 7.000000 | 23.000000 | 39.000000 | 35.000000 | 38.000000 | 1.000000 | ... | 9.000000 | 30.000000 | 0.000000 | 17.000000 | 12.000000 | 9.000000 | 27.000000 | 30.000000 | 38.000000 | 33.000000 |
| 95% | 149.000000 | 0.000000 | 305.350000 | 5.000000 | 63.000000 | 242.000000 | 340.000000 | 272.000000 | 294.700000 | 9.000000 | ... | 63.000000 | 176.000000 | 52.000000 | 137.000000 | 95.000000 | 68.000000 | 202.000000 | 221.000000 | 255.350000 | 211.000000 |
| max | 46744.000000 | 14.000000 | 82047.000000 | 289.000000 | 10582.000000 | 46856.000000 | 29534.000000 | 50972.000000 | 36236.000000 | 1707.000000 | ... | 17800.000000 | 23355.000000 | 81952.000000 | 29540.000000 | 12149.000000 | 19285.000000 | 28021.000000 | 40708.000000 | 46261.000000 | 68790.000000 |
10 rows × 383 columns
Just by giving a quick look at the distributions of single cells, we see that many of them are highly rigt skewed as they are filled with 0 values. They are not normalized, they do not have unit variance and 0 mean, their std deviation is higly large compared to the mean.
Let's choose 10 random variables, to visualize their not-normal distributions:
random_variable_indices = [randint(0,383) for i in range(0,10)]
print(random_variable_indices)
for i in random_variable_indices:
sns.displot(
mcf7_smarts_unfiltered,
x= mcf7_smarts_unfiltered.columns.tolist()[i],
kind="kde"
)
[108, 161, 252, 99, 203, 213, 315, 86, 322, 99]
The plots confirm what we wrote about the distribution characterstic: we observe an elongated right tail.
We continue exploratory data analysis with investigating outleirs. We choose to use IQR formula to detect the outliers. Anything outside this range, will be dropped:
Q1 = mcf7_smarts_unfiltered.quantile(0.25)
Q3 = mcf7_smarts_unfiltered.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
"output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" 17.0
"output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" 0.0
"output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" 5.0
"output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" 0.0
"output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" 7.0
...
"output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" 9.0
"output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" 27.0
"output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" 30.0
"output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" 38.0
"output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" 33.0
Length: 383, dtype: float64
mcf7_smarts_unfiltered_noOut = mcf7_smarts_unfiltered[~((mcf7_smarts_unfiltered < (Q1 - 1.5 * IQR)) |(mcf7_smarts_unfiltered > (Q3 + 1.5 * IQR))).any(axis=1)]
print(mcf7_smarts_unfiltered_noOut.shape)
(6435, 383)
mcf7_smarts_unfiltered.shape
(22934, 383)
100*(22934-6435)/22934
71.94122263887678
Using interquartile range method to remove the outliers results in taking away 72% of our dataset which is not a desired outcome. As we observed above, many observations are filled with 0s.
We can quantify sparisty this way: if X% (where X>threshold) of the gene expressions of an observation is 0, then we consider that observation to be highly selective of some specific gene expressions and hence sparse.
If an entire dataset has mostly sparse observation then we can say it is a sparse structure.
def variable_sparsity(variable_series, threshold):
if len(variable_series[variable_series == 0])/len(variable_series)>=threshold:
return 1 # it is sparse
else:
return 0 # it is not sparse
df_info_sparsity_th95 = (
pd.DataFrame(mcf7_smarts_unfiltered.apply(lambda x: variable_sparsity(x, 0.95), axis=0))
.reset_index()
.rename(columns={'index':'cell', 0:'flag_sparsity'})
)
df_info_sparsity_th50 = (
pd.DataFrame(mcf7_smarts_unfiltered.apply(lambda x: variable_sparsity(x, 0.50), axis=0))
.reset_index()
.rename(columns={'index':'cell', 0:'flag_sparsity'})
)
df_info_sparsity_th95
| cell | flag_sparsity | |
|---|---|---|
| 0 | "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCo... | 0 |
| 1 | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCo... | 1 |
| 2 | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCo... | 0 |
| 3 | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoor... | 0 |
| 4 | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoor... | 0 |
| ... | ... | ... |
| 378 | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCo... | 0 |
| 379 | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCo... | 0 |
| 380 | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCo... | 0 |
| 381 | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCo... | 0 |
| 382 | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCo... | 0 |
383 rows × 2 columns
print(len(df_info_sparsity_th95[df_info_sparsity_th95.flag_sparsity ==1]))
print(len(df_info_sparsity_th50[df_info_sparsity_th50.flag_sparsity ==1]))
16 348
100*(348/383)
90.86161879895562
We defined two thresolds for sparsity: 95% and 50%.
We see that 91% of the single cells have more than half of the gene expressions as 0, which means we do not have densely expressive cells.
We can do the same analysis for the sparsity of the features themselves:
df_info_sparsity_th95_feature = (
pd.DataFrame(mcf7_smarts_unfiltered.T.apply(lambda x: variable_sparsity(x, 0.95), axis=0))
.reset_index()
.rename(columns={'index':'variable', 0:'flag_sparsity'})
)
df_info_sparsity_th50_feature = (
pd.DataFrame(mcf7_smarts_unfiltered.T.apply(lambda x: variable_sparsity(x, 0.50), axis=0))
.reset_index()
.rename(columns={'index':'variable', 0:'flag_sparsity'})
)
print(len(df_info_sparsity_th95_feature[df_info_sparsity_th95_feature.flag_sparsity ==1])/len(mcf7_smarts_unfiltered.T))
print(len(df_info_sparsity_th50_feature[df_info_sparsity_th50_feature.flag_sparsity ==1])/len(mcf7_smarts_unfiltered.T))
17.963446475195823 35.01827676240209
We see that almost 30% of cells are not expressed in at least half of the single cells, 18% percent of them are expressed in only 5% of the single cells.
As we noted earlier, looking at the descriptive statistics and some density plots, the variables are highly centered around zero. Let's quantify the skewness and kurtosis:
colN = mcf7_smarts_unfiltered.shape[1]
colN
list_skew_cells = []
for i in range(colN) :
v_df = mcf7_smarts_unfiltered[mcf7_smarts_unfiltered.columns.tolist()[i]]
list_skew_cells += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
list_skew_cells
sns.histplot(list_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
Text(0.5, 0, 'Skewness of single cells expression profiles - original df')
list_skew_cells
[65.37006963464164, 38.757057757830026, 48.17071111386066, 25.526517911830066, 61.845102435720776, 67.08249677429846, 36.612008466309156, 71.07021870973485, 46.99681344066586, 50.732659085234005, 62.05606742798137, 48.506412433375324, 43.980148001172424, 44.8691309157291, 45.240902627133956, 78.66242273596696, 59.01380313665439, 21.928874819985605, 73.8210177139657, 59.04932828249711, 66.22275900553667, 58.96593667640693, 44.96986532634896, 103.86731864915672, 57.89553479345821, 52.64980700309224, 80.00807665860066, 29.81450276935768, 75.88569276983516, 62.77073674327232, 58.967314127982874, 56.35768787121093, 57.05824203467975, 57.40692781832745, 35.865560499764676, 69.70304704232541, 51.54771295843748, 50.4302831519899, 38.05623231866659, 63.84215005551229, 55.871098661374255, 50.13985775387157, 65.50723403031346, 35.10247604200467, 39.35587400774901, 57.63201263907013, 82.75324463209218, 61.22967602896186, 52.80763510345485, 68.07217898419657, 38.55514525527467, 28.60624381707881, 57.04071732117267, 63.68131596313151, 79.63103872057852, 69.43867475530776, 53.13234968191677, 49.192150725265705, 47.112300263013076, 62.81197840747485, 19.3702591267805, 62.78719522052137, 53.00163265324349, 68.38942424936073, 58.84860621258022, 49.859036595788865, 59.91118521737059, 58.29296442147926, 28.501485077725295, 56.289739960677416, 56.10921017282855, 80.81135774324086, 52.495958672786834, 47.58980893389708, 41.96611029532434, 48.44515799334938, 66.49582265763209, 23.444091632988297, 52.6288364648812, 27.01337262028634, 88.77921502160804, 57.77254998825772, 51.71053374010335, 50.02425317120022, 58.948596486795985, 57.66566722736542, 86.92872751039371, 151.42985189058763, 63.47323980421276, 67.88538067693766, 51.26198644163647, 32.24058786142207, 69.40202095211392, 46.81290939092324, 49.6607568393191, 61.33878776986664, 56.58642800014645, 47.75228994353625, 41.16425467210575, 70.65867926429611, 48.48662025962603, 64.85100145588382, 64.22086905409589, 61.22079856666489, 71.59721945334587, 41.000702954119866, 37.113795379790865, 65.3756961569521, 39.52786094768862, 48.48692428636458, 44.90091478104251, 57.08882854216362, 45.46169879104459, 77.45978738263878, 63.80812249178675, 34.38382743402901, 55.41749408432321, 48.36925753145954, 92.29558714438865, 69.3272535990563, 57.852382900121256, 56.002106957864434, 40.950275407057674, 67.11130036308852, 58.1790979163914, 75.04440602127937, 75.55781400017814, 67.52539384663852, 62.22875559391499, 55.245801103305816, 52.11997569576362, 65.48046731794409, 63.284230784177375, 65.97009851183371, 43.0971694633296, 62.79522825908089, 49.913435939904936, 81.99209611013521, 78.8205116462107, 55.0366835690786, 73.11885173108082, 55.27448836616678, 34.20405563505985, 55.315948086886095, 48.949031881255685, 43.636576566402475, 49.772452537702556, 86.87630310795878, 50.23688476394104, 45.309096880336405, 71.165159853489, 49.17996199388713, 33.70766383906172, 57.429479798165424, 45.554591660431726, 58.57925862240614, 54.96155619969472, 41.66781943793925, 49.928759940565214, 63.228277214449676, 52.067252516491344, 37.75186366026408, 27.55032414008304, 62.076915640950254, 55.8853155118677, 53.07771799267121, 50.456446956447074, 46.02587172930201, 45.6534887932207, 47.6029620912295, 48.79202221125533, 70.29184648933408, 65.83827094383773, 42.02355427642308, 60.723574231084704, 70.90393849474093, 67.73687426971699, 53.07907840285801, 62.583609715782, 35.25752813740689, 49.51834336616349, 71.62583311555292, 53.560075746016565, 44.02627157401278, 52.169897375742316, 47.99875691571871, 71.42270823353229, 38.79044887444134, 52.549923213549754, 102.00553909241644, 74.67616191725513, 41.587093801025006, 41.07405111741416, 69.24061512142902, 55.51674558282735, 47.287725753277044, 50.62779299327853, 67.09100520476996, 63.02465200124579, 55.809551062345115, 45.210818972300835, 53.9019816512615, 48.01457632487985, 58.92394704055915, 44.434469327216945, 46.69664550227368, 54.454665220207076, 65.03016271921076, 50.840358171001455, 53.33744522132019, 42.46705736466091, 76.33286045926278, 30.531568425529667, 52.1050761714593, 46.60826350047661, 66.16731163140442, 90.06741029302918, 51.83310117931454, 54.061554552285614, 80.42087059750779, 56.222173310389245, 33.189739176431175, 83.07320745361069, 40.270591561112326, 49.63738946086104, 53.84720177918596, 66.99937800855412, 51.609322214281676, 36.65435460378276, 59.74899650860245, 59.85719072791244, 38.940393550720785, 37.79837500859966, 25.604759215404652, 72.39966662785889, 58.06978386246665, 63.19590337451561, 49.800119478455194, 50.061938422154654, 58.88894716023596, 77.28716957436322, 82.27148189244458, 56.553363197634475, 47.729399617201175, 69.81359415881892, 113.20425129901093, 65.18150861698093, 38.815290573255304, 55.63423703442932, 33.609242423947734, 42.72302362269744, 49.173189755528064, 43.493285883258245, 39.65245251165305, 43.36192361199769, 49.483543781273355, 42.107892306470745, 51.48034437539236, 63.180866952454224, 71.95750206402795, 72.00104408794913, 47.91424362443015, 68.9353900895262, 58.23068258622872, 72.25297578166035, 51.51956751296059, 44.42163641959091, 37.33275502942067, 53.401988247033124, 128.3647319852565, 40.59645956146311, 57.34418120355423, 61.21313072460142, 50.737798613088025, 40.83751696848545, 44.932773567934284, 54.941524288953005, 50.805866691556915, 52.096110749133835, 71.85890874890647, 35.994762722909094, 44.58104259411048, 68.56784405645614, 51.46137421898208, 50.57147792922357, 42.800858225291876, 80.54261974846894, 40.056339151737824, 51.334874617361166, 66.07230735422311, 30.256935419781588, 47.00066314495175, 63.31245336946145, 52.442291472872995, 45.2429966807645, 9.836727639351762, 44.787809892896504, 58.58793766498599, 47.9916696942846, 66.67199714358286, 45.24064888864398, 57.965051174038365, 52.17258827703882, 52.5113657289115, 34.87846271571585, 60.314009208597, 50.720264643709235, 35.73602045910699, 38.83912309254386, 41.621661030507035, 65.05669128764822, 66.3044797746376, 41.44768409179196, 73.08046360294188, 51.66666303261338, 49.51296417251522, 54.28789488993307, 53.25292548622611, 61.48266283476552, 57.201273454519054, 59.02120697299761, 48.44275386255713, 40.369886011944, 48.8898867260156, 79.4919394353658, 48.29992043587732, 47.64484729222149, 68.82717178105322, 60.051239695517395, 61.77232823280864, 80.5955800660854, 67.87063807207537, 65.23037130601409, 42.67733932500217, 58.255003067772634, 43.94290984871098, 50.73989045979915, 45.881591503799406, 58.937397893863796, 42.13103379751656, 76.68417975207576, 39.854961087533724, 77.16580657307922, 49.92084058825978, 43.6392170674971, 58.519161764867725, 46.30193960834268, 49.97914243231349, 61.02338455788791, 70.82607923098708, 47.03568674095103, 76.90641659959971, 53.75629432980673, 46.767513936340904, 76.3884160052077, 60.25248714743098, 64.08040256481473, 41.5877999374769, 41.61003158779742, 68.1574084189103, 68.04752931963532, 51.365908231016036, 54.12555928847672, 28.82670255055156, 48.771281619284586, 58.87231460915243, 50.80823842400135, 72.60835102225886, 53.960143878801304, 57.25166850896428, 47.732909145755336, 49.36553372642859, 54.24134487755268, 55.885586761007296, 54.48383061299693, 84.44221560429415, 73.88334824443653, 48.581774224165926, 74.40484049244661, 45.49129630227665, 42.08338649692241, 47.99358737778641, 56.033824762270264]
list_kurt_cells = []
for i in range(colN) :
v_df_kurt = mcf7_smarts_unfiltered[mcf7_smarts_unfiltered.columns.tolist()[i]]
list_kurt_cells += [kurtosis(v_df_kurt)]
# df_kurt_cells += [df[mcf7_smarts_unfiltered.columns.tolist()[i]].kurt()]
list_kurt_cells
sns.histplot(list_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - original df')
Text(0.5, 0, 'Kurtosis of single cells expression profiles - original df')
As suggested by the histograms, the data is far from normal distribution with many samples having high skewness and kurtosis values.
This is not good because when the data is not following a normal distribution, it might violate some assumptions of a machine learning model. Or it might just make it hard for the algorithm to detect the differences among non 0 values.
One way to make the distribution less skewed is to apply a log transformation:
var21_log2 = np.log2(mcf7_smarts_unfiltered[mcf7_smarts_unfiltered.columns.tolist()[20]]+1)
sns.boxplot(x=var21_log2)
<AxesSubplot:xlabel='"output.STAR.1_B6_Norm_S54_Aligned.sortedByCoord.out.bam"'>
The same sample before log transformation is heavily centered around 0:
sns.boxplot(x=mcf7_smarts_unfiltered[mcf7_smarts_unfiltered.columns.tolist()[20]]+1)
<AxesSubplot:xlabel='"output.STAR.1_B6_Norm_S54_Aligned.sortedByCoord.out.bam"'>
var21_log2.describe().round(2)
count 22934.00 mean 2.49 std 2.98 min 0.00 25% 0.00 50% 0.00 75% 5.04 max 15.30 Name: "output.STAR.1_B6_Norm_S54_Aligned.sortedByCoord.out.bam", dtype: float64
Now lets take only the first 50 columns/cells (for speed), and calculate skewness and kurtosis after applying log transformation to them:
df_mcf7_50vars_log2 = (mcf7_smarts_unfiltered.iloc[:,:50]+1).apply(np.log2)
print(df_mcf7_50vars_log2.shape)
(22934, 50)
df_mcf7_50vars_log2
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.1_D2_Norm_S146_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D3_Norm_S147_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D4_Norm_S148_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D5_Norm_S149_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D6_Norm_S150_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D7_Hypo_S169_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D8_Hypo_S170_Aligned.sortedByCoord.out.bam" | "output.STAR.1_D9_Hypo_S171_Aligned.sortedByCoord.out.bam" | "output.STAR.1_E10_Hypo_S220_Aligned.sortedByCoord.out.bam" | "output.STAR.1_E11_Hypo_S221_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "WASH7P" | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
| "MIR6859-1" | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
| "WASH9P" | 1.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.00000 | 3.459432 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
| "OR4F29" | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
| "MTND1P23" | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "MT-TE" | 2.321928 | 0.0 | 0.000000 | 0.000000 | 2.000000 | 2.00000 | 0.000000 | 0.000000 | 3.906891 | 1.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 1.584963 | 0.000000 | 1.584963 | 0.0 | 0.0 | 0.0 | 2.807355 |
| "MT-CYB" | 8.082149 | 1.0 | 6.266787 | 6.066089 | 9.507795 | 11.40833 | 13.187197 | 11.667999 | 12.956739 | 4.954196 | ... | 8.108524 | 7.209453 | 10.368506 | 11.498849 | 4.584963 | 7.622052 | 0.0 | 5.0 | 0.0 | 7.483816 |
| "MT-TT" | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 1.00000 | 1.000000 | 1.000000 | 2.321928 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.000000 |
| "MT-TP" | 2.584963 | 0.0 | 0.000000 | 1.000000 | 0.000000 | 1.00000 | 1.000000 | 2.321928 | 1.584963 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.0 | 1.0 | 0.0 | 1.584963 |
| "MAFIP" | 3.169925 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.584963 | 0.0 | 0.0 | 0.0 | 1.584963 |
22934 rows × 50 columns
np.shape(df_mcf7_50vars_log2)
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_mcf7_50vars_log2,palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
df_mcf7_allVars_log2 = (mcf7_smarts_unfiltered.iloc[:,:]+1).apply(np.log2) # log transformation of all variables
df1_log2_skew_cells = []
for i in range(df_mcf7_allVars_log2.shape[1]):
v_df = df_mcf7_allVars_log2[df_mcf7_allVars_log2.columns.tolist()[i]]
df1_log2_skew_cells += [skew(v_df)]
df1_log2_skew_cells
sns.histplot(df1_log2_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - log2 df')
Text(0.5, 0, 'Skewness of single cells expression profiles - log2 df')
Now most of the variables have skewness score around 0 as we expected after applying the log transformation
df1_log2_kurt_cells = []
for i in range(df_mcf7_allVars_log2.shape[1]) :
v_df = df_mcf7_allVars_log2[df_mcf7_allVars_log2.columns.tolist()[i]]
df1_log2_kurt_cells += [kurtosis(v_df)]
df1_log2_kurt_cells
sns.histplot(df1_log2_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - log2 df')
Text(0.5, 0, 'Kurtosis of single cells expression profiles - log2 df')
len(df1_log2_kurt_cells)
383
for i in random_variable_indices:
sns.displot(
df_mcf7_allVars_log2,
x= df_mcf7_allVars_log2.columns.tolist()[i],
kind="kde"
)
If we compare the density plots with the ones above before the transformation, we see them changed, the distributions become more bimodal.
df_mcf7_allVars_log2_small = df_mcf7_allVars_log2.iloc[:, 10:30] #just selecting part of the samples so run time not too long
sns.displot(data=df_mcf7_allVars_log2_small,palette="Set3",kind="kde", bw_adjust=2)
<seaborn.axisgrid.FacetGrid at 0x7fda5c9646d0>
df_mcf7_allVars_log2_small.describe()
| "output.STAR.1_A8_Hypo_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A9_Hypo_S27_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B10_Hypo_S76_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B11_Hypo_S77_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B12_Hypo_S78_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B1_Norm_S49_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B2_Norm_S50_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B3_Norm_S51_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B4_Norm_S52_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B5_Norm_S53_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B6_Norm_S54_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B7_Hypo_S73_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B8_Hypo_S74_Aligned.sortedByCoord.out.bam" | "output.STAR.1_B9_Hypo_S75_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C10_Hypo_S124_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C11_Hypo_S125_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C12_Hypo_S126_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C1_Norm_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C2_Norm_S98_Aligned.sortedByCoord.out.bam" | "output.STAR.1_C3_Norm_S99_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 | 22934.000000 |
| mean | 0.353154 | 0.657661 | 2.409594 | 2.272319 | 2.190513 | 1.084944 | 1.055311 | 1.243341 | 1.400531 | 2.471422 | 2.491420 | 1.413575 | 2.181846 | 0.003144 | 1.587593 | 2.186045 | 2.346246 | 1.546102 | 1.735698 | 1.171161 |
| std | 0.832173 | 1.329864 | 3.288693 | 3.029951 | 2.752105 | 1.669870 | 1.661270 | 1.963176 | 2.107611 | 3.047782 | 2.981960 | 2.085261 | 2.797098 | 0.073732 | 2.484960 | 2.904428 | 3.031527 | 2.134695 | 2.420870 | 1.780175 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.000000 | 1.000000 | 5.459432 | 5.044394 | 4.584963 | 2.000000 | 2.000000 | 2.321928 | 2.584963 | 5.044394 | 5.044394 | 2.807355 | 4.584963 | 0.000000 | 3.321928 | 4.700440 | 5.087463 | 3.000000 | 3.459432 | 2.000000 |
| max | 10.308339 | 11.898223 | 15.884075 | 15.774916 | 14.609352 | 12.541339 | 11.826548 | 11.240195 | 13.730046 | 15.469038 | 15.297991 | 14.536369 | 15.042215 | 5.209453 | 15.387378 | 16.118171 | 16.574416 | 11.837628 | 14.464354 | 12.264736 |
Let's normalize the data between cells with Normalizer transformer of sklearn:
from sklearn.preprocessing import Normalizer
transformer = Normalizer().fit(df_mcf7_allVars_log2)
df_mcf7_allVars_log2_norm = pd.DataFrame(
transformer.transform(df_mcf7_allVars_log2),
columns=df_mcf7_allVars_log2.columns
)
df_mcf7_allVars_log2_norm.describe().round(2)
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | ... | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 | 22934.00 |
| mean | 0.02 | 0.00 | 0.02 | 0.00 | 0.02 | 0.03 | 0.04 | 0.04 | 0.03 | 0.01 | ... | 0.02 | 0.03 | 0.01 | 0.03 | 0.02 | 0.02 | 0.03 | 0.03 | 0.03 | 0.03 |
| std | 0.04 | 0.01 | 0.05 | 0.01 | 0.03 | 0.04 | 0.05 | 0.05 | 0.05 | 0.01 | ... | 0.03 | 0.05 | 0.03 | 0.04 | 0.03 | 0.04 | 0.05 | 0.05 | 0.05 | 0.05 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 75% | 0.05 | 0.00 | 0.05 | 0.00 | 0.04 | 0.06 | 0.06 | 0.06 | 0.06 | 0.01 | ... | 0.04 | 0.06 | 0.00 | 0.05 | 0.05 | 0.04 | 0.06 | 0.06 | 0.06 | 0.06 |
| max | 0.71 | 0.71 | 0.93 | 0.47 | 0.58 | 0.85 | 0.79 | 0.93 | 0.93 | 0.50 | ... | 0.89 | 0.98 | 0.95 | 0.91 | 0.80 | 0.92 | 0.98 | 0.85 | 0.93 | 0.85 |
8 rows × 383 columns
for i in random_variable_indices:
sns.displot(
df_mcf7_allVars_log2,
x= df_mcf7_allVars_log2_norm.columns.tolist()[i],
kind="kde"
)
I am loading the filtered and filtered+normalized datasets to make a comparison as requested:
mcf7_smarts_filtered = pd.read_csv("SmartSeq/MCF7_SmartS_Filtered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_smarts_filtered))
print("First column: ", mcf7_smarts_filtered.iloc[ : , 0])
mcf7_smarts_filtered_normalized = pd.read_csv("SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_smarts_filtered_normalized))
print("First column: ", mcf7_smarts_filtered_normalized.iloc[ : , 0])
Dataframe dimensions: (18945, 313)
First column: "WASH7P" 0
"MIR6859-1" 0
"WASH9P" 1
"OR4F29" 0
"MTND1P23" 0
...
"MT-TE" 4
"MT-CYB" 270
"MT-TT" 0
"MT-TP" 5
"MAFIP" 8
Name: "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam", Length: 18945, dtype: int64
Dataframe dimensions: (3000, 250)
First column: "CYP1B1" 343
"CYP1B1-AS1" 140
"CYP1A1" 0
"NDRG1" 0
"DDIT4" 386
...
"GRIK5" 0
"SLC25A27" 0
"DENND5A" 51
"CDK5R1" 0
"FAM13A-AS1" 0
Name: "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam", Length: 3000, dtype: int64
for i in random_variable_indices:
print(i)
if i<mcf7_smarts_filtered.shape[1]:
sns.displot(
mcf7_smarts_filtered ,
x= mcf7_smarts_filtered.columns.tolist()[i],
kind="kde"
)
108 161 252 99 203 213 315 86 322 99
The samples in the filtered dataset has the same shape of the distribution:
for i in random_variable_indices:
if i<mcf7_smarts_filtered_normalized.shape[1]:
sns.displot(
mcf7_smarts_filtered_normalized,
x= mcf7_smarts_filtered_normalized.columns.tolist()[i],
kind="kde"
)
colN_filtered_normalized = mcf7_smarts_filtered_normalized.shape[1]
colN_filtered_normalized
list_skew_cells_filtered_normalized = []
for i in range(colN_filtered_normalized) :
v_df = mcf7_smarts_filtered_normalized[mcf7_smarts_filtered_normalized.columns.tolist()[i]]
list_skew_cells_filtered_normalized += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
list_skew_cells_filtered_normalized
sns.histplot(list_skew_cells_filtered_normalized,bins=100)
plt.xlabel('Skewness of single cells expression profiles - filtered & normalized df')
Text(0.5, 0, 'Skewness of single cells expression profiles - filtered & normalized df')
colN_filtered = mcf7_smarts_filtered.shape[1]
colN_filtered
list_skew_cells_filtered = []
for i in range(colN_filtered) :
v_df = mcf7_smarts_filtered[mcf7_smarts_filtered.columns.tolist()[i]]
list_skew_cells_filtered += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
list_skew_cells_filtered
sns.histplot(list_skew_cells_filtered,bins=100)
plt.xlabel('Skewness of single cells expression profiles - filtered df')
Text(0.5, 0, 'Skewness of single cells expression profiles - filtered df')
mcf7_smarts_filtered[['"output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam"']].describe().round(2)
| "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam" | |
|---|---|
| count | 18945.00 |
| mean | 84.70 |
| std | 730.31 |
| min | 0.00 |
| 25% | 0.00 |
| 50% | 2.00 |
| 75% | 51.00 |
| max | 64491.00 |
mcf7_smarts_filtered_normalized[['"output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam"']].describe().round(2)
| "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam" | |
|---|---|
| count | 3000.00 |
| mean | 74.14 |
| std | 345.01 |
| min | 0.00 |
| 25% | 0.00 |
| 50% | 0.00 |
| 75% | 24.00 |
| max | 8222.00 |
Normalized that dataset has has smaller std, the range of values it accepts is shorter.
duplicate_rows_df_mcf7_allVars_log2 = df_mcf7_allVars_log2[df_mcf7_allVars_log2.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows_df_mcf7_allVars_log2.shape)
print("number of duplicate rows: ", duplicate_rows_df_mcf7_allVars_log2)
number of duplicate rows: (56, 383)
number of duplicate rows: "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.000000
"IL12RB2" 0.000000
"S1PR1" 0.000000
"CD84" 0.000000
"GNLY" 0.000000
"FAR2P3" 0.000000
"KLF2P3" 0.000000
"PABPC1P2" 0.000000
"UGT1A8" 5.857981
"UGT1A9" 5.857981
"SLC22A14" 0.000000
"COQ10BP2" 0.000000
"PANDAR" 0.000000
"LAP3P2" 0.000000
"RPL22P16" 0.000000
"GALNT17" 0.000000
"PON1" 0.000000
"HTR5A" 0.000000
"SNORA36A" 0.000000
"MIR664B" 0.000000
"CSMD1" 0.000000
"KCNS2" 0.000000
"MIR548AA1" 0.000000
"MIR548D1" 0.000000
"MTCO2P11" 0.000000
"CLCN3P1" 0.000000
"SUGT1P4-STRA6LP" 0.000000
"STRA6LP" 0.000000
"MUC6" 0.000000
"VSTM4" 0.000000
"LINC00856" 0.000000
"LINC00595" 0.000000
"CACYBPP1" 0.000000
"LINC00477" 0.000000
"KNOP1P1" 0.000000
"WDR95P" 0.000000
"MIR20A" 0.000000
"MIR19B1" 0.000000
"RPL21P5" 0.000000
"RNU6-539P" 0.000000
"SNRPN" 0.000000
"SNURF" 0.000000
"RBFOX1" 0.000000
"LINC02183" 0.000000
"MT1M" 0.000000
"ASPA" 0.000000
"BCL6B" 0.000000
"CCL3L3" 0.000000
"CCL3L1" 0.000000
"OTOP3" 0.000000
"RNA5SP450" 0.000000
"PSG1" 0.000000
"MIR3190" 0.000000
"MIR3191" 0.000000
"SEZ6L" 0.000000
"ADAMTS5" 0.000000
"output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 1.0
"LAP3P2" 1.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.000000
"IL12RB2" 0.000000
"S1PR1" 0.000000
"CD84" 0.000000
"GNLY" 0.000000
"FAR2P3" 0.000000
"KLF2P3" 0.000000
"PABPC1P2" 0.000000
"UGT1A8" 0.000000
"UGT1A9" 0.000000
"SLC22A14" 0.000000
"COQ10BP2" 0.000000
"PANDAR" 0.000000
"LAP3P2" 0.000000
"RPL22P16" 0.000000
"GALNT17" 0.000000
"PON1" 0.000000
"HTR5A" 0.000000
"SNORA36A" 0.000000
"MIR664B" 0.000000
"CSMD1" 0.000000
"KCNS2" 0.000000
"MIR548AA1" 0.000000
"MIR548D1" 0.000000
"MTCO2P11" 0.000000
"CLCN3P1" 0.000000
"SUGT1P4-STRA6LP" 1.584963
"STRA6LP" 1.584963
"MUC6" 0.000000
"VSTM4" 0.000000
"LINC00856" 0.000000
"LINC00595" 0.000000
"CACYBPP1" 0.000000
"LINC00477" 0.000000
"KNOP1P1" 0.000000
"WDR95P" 0.000000
"MIR20A" 0.000000
"MIR19B1" 0.000000
"RPL21P5" 0.000000
"RNU6-539P" 0.000000
"SNRPN" 0.000000
"SNURF" 0.000000
"RBFOX1" 0.000000
"LINC02183" 0.000000
"MT1M" 0.000000
"ASPA" 0.000000
"BCL6B" 0.000000
"CCL3L3" 0.000000
"CCL3L1" 0.000000
"OTOP3" 0.000000
"RNA5SP450" 0.000000
"PSG1" 0.000000
"MIR3190" 0.000000
"MIR3191" 0.000000
"SEZ6L" 0.000000
"ADAMTS5" 0.000000
"output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
... \
"SHISAL2A" ...
"IL12RB2" ...
"S1PR1" ...
"CD84" ...
"GNLY" ...
"FAR2P3" ...
"KLF2P3" ...
"PABPC1P2" ...
"UGT1A8" ...
"UGT1A9" ...
"SLC22A14" ...
"COQ10BP2" ...
"PANDAR" ...
"LAP3P2" ...
"RPL22P16" ...
"GALNT17" ...
"PON1" ...
"HTR5A" ...
"SNORA36A" ...
"MIR664B" ...
"CSMD1" ...
"KCNS2" ...
"MIR548AA1" ...
"MIR548D1" ...
"MTCO2P11" ...
"CLCN3P1" ...
"SUGT1P4-STRA6LP" ...
"STRA6LP" ...
"MUC6" ...
"VSTM4" ...
"LINC00856" ...
"LINC00595" ...
"CACYBPP1" ...
"LINC00477" ...
"KNOP1P1" ...
"WDR95P" ...
"MIR20A" ...
"MIR19B1" ...
"RPL21P5" ...
"RNU6-539P" ...
"SNRPN" ...
"SNURF" ...
"RBFOX1" ...
"LINC02183" ...
"MT1M" ...
"ASPA" ...
"BCL6B" ...
"CCL3L3" ...
"CCL3L1" ...
"OTOP3" ...
"RNA5SP450" ...
"PSG1" ...
"MIR3190" ...
"MIR3191" ...
"SEZ6L" ...
"ADAMTS5" ...
"output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 1.0
"LAP3P2" 1.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.000000
"IL12RB2" 0.000000
"S1PR1" 0.000000
"CD84" 0.000000
"GNLY" 0.000000
"FAR2P3" 0.000000
"KLF2P3" 0.000000
"PABPC1P2" 0.000000
"UGT1A8" 2.584963
"UGT1A9" 2.584963
"SLC22A14" 0.000000
"COQ10BP2" 0.000000
"PANDAR" 0.000000
"LAP3P2" 0.000000
"RPL22P16" 0.000000
"GALNT17" 0.000000
"PON1" 0.000000
"HTR5A" 0.000000
"SNORA36A" 0.000000
"MIR664B" 0.000000
"CSMD1" 0.000000
"KCNS2" 0.000000
"MIR548AA1" 0.000000
"MIR548D1" 0.000000
"MTCO2P11" 0.000000
"CLCN3P1" 0.000000
"SUGT1P4-STRA6LP" 4.321928
"STRA6LP" 4.321928
"MUC6" 0.000000
"VSTM4" 0.000000
"LINC00856" 0.000000
"LINC00595" 0.000000
"CACYBPP1" 0.000000
"LINC00477" 0.000000
"KNOP1P1" 0.000000
"WDR95P" 0.000000
"MIR20A" 0.000000
"MIR19B1" 0.000000
"RPL21P5" 0.000000
"RNU6-539P" 0.000000
"SNRPN" 0.000000
"SNURF" 0.000000
"RBFOX1" 0.000000
"LINC02183" 0.000000
"MT1M" 0.000000
"ASPA" 0.000000
"BCL6B" 0.000000
"CCL3L3" 0.000000
"CCL3L1" 0.000000
"OTOP3" 0.000000
"RNA5SP450" 0.000000
"PSG1" 0.000000
"MIR3190" 0.000000
"MIR3191" 0.000000
"SEZ6L" 0.000000
"ADAMTS5" 0.000000
"output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 1.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 1.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 1.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 1.0
"CCL3L3" 2.0
"CCL3L1" 2.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 1.0
"output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 1.0
"LAP3P2" 1.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 3.0
"LINC00595" 3.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" \
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 1.0
"LAP3P2" 1.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
"output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam"
"SHISAL2A" 0.0
"IL12RB2" 0.0
"S1PR1" 0.0
"CD84" 0.0
"GNLY" 0.0
"FAR2P3" 0.0
"KLF2P3" 0.0
"PABPC1P2" 0.0
"UGT1A8" 0.0
"UGT1A9" 0.0
"SLC22A14" 0.0
"COQ10BP2" 0.0
"PANDAR" 0.0
"LAP3P2" 0.0
"RPL22P16" 0.0
"GALNT17" 0.0
"PON1" 0.0
"HTR5A" 0.0
"SNORA36A" 0.0
"MIR664B" 0.0
"CSMD1" 0.0
"KCNS2" 0.0
"MIR548AA1" 0.0
"MIR548D1" 0.0
"MTCO2P11" 0.0
"CLCN3P1" 0.0
"SUGT1P4-STRA6LP" 0.0
"STRA6LP" 0.0
"MUC6" 0.0
"VSTM4" 0.0
"LINC00856" 0.0
"LINC00595" 0.0
"CACYBPP1" 0.0
"LINC00477" 0.0
"KNOP1P1" 0.0
"WDR95P" 0.0
"MIR20A" 0.0
"MIR19B1" 0.0
"RPL21P5" 0.0
"RNU6-539P" 0.0
"SNRPN" 0.0
"SNURF" 0.0
"RBFOX1" 0.0
"LINC02183" 0.0
"MT1M" 0.0
"ASPA" 0.0
"BCL6B" 0.0
"CCL3L3" 0.0
"CCL3L1" 0.0
"OTOP3" 0.0
"RNA5SP450" 0.0
"PSG1" 0.0
"MIR3190" 0.0
"MIR3191" 0.0
"SEZ6L" 0.0
"ADAMTS5" 0.0
[56 rows x 383 columns]
To understand which genes convey the same information, we can check their correlations.
#print("names of duplicate rows: ",duplicate_rows_df.index)
duplicate_rows_df_mcf7_allVars_log2_t = duplicate_rows_df_mcf7_allVars_log2.T
duplicate_rows_df_mcf7_allVars_log2_t
c_dupl = duplicate_rows_df_mcf7_allVars_log2_t.corr()
c_dupl
| "SHISAL2A" | "IL12RB2" | "S1PR1" | "CD84" | "GNLY" | "FAR2P3" | "KLF2P3" | "PABPC1P2" | "UGT1A8" | "UGT1A9" | ... | "BCL6B" | "CCL3L3" | "CCL3L1" | "OTOP3" | "RNA5SP450" | "PSG1" | "MIR3190" | "MIR3191" | "SEZ6L" | "ADAMTS5" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "SHISAL2A" | 1.000000 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "IL12RB2" | 0.595969 | 1.000000 | 0.719609 | 0.902085 | 0.595969 | -0.008126 | -0.008126 | 0.975214 | -0.013187 | -0.013187 | ... | 0.595969 | -0.011407 | -0.011407 | 0.801883 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.785477 | 0.713849 |
| "S1PR1" | 0.600789 | 0.719609 | 1.000000 | 0.452186 | 0.600789 | -0.008102 | -0.008102 | 0.600789 | -0.013148 | -0.013148 | ... | 0.600789 | -0.011372 | -0.011372 | 0.304354 | -0.005104 | 0.600789 | -0.005104 | -0.005104 | 0.282777 | 0.719609 |
| "CD84" | 0.374125 | 0.902085 | 0.452186 | 1.000000 | 0.374125 | -0.008126 | -0.008126 | 0.975214 | -0.013187 | -0.013187 | ... | 0.374125 | -0.011407 | -0.011407 | 0.981215 | -0.005119 | 0.374125 | -0.005119 | -0.005119 | 0.975655 | 0.448546 |
| "GNLY" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 1.000000 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 1.000000 | 0.113449 | 0.113449 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.975214 |
| "FAR2P3" | -0.008333 | -0.008126 | -0.008102 | -0.008126 | -0.008333 | 1.000000 | 1.000000 | -0.008333 | -0.021465 | -0.021465 | ... | -0.008333 | -0.018567 | -0.018567 | -0.007618 | -0.008333 | -0.008333 | -0.008333 | -0.008333 | -0.007524 | -0.008126 |
| "KLF2P3" | -0.008333 | -0.008126 | -0.008102 | -0.008126 | -0.008333 | 1.000000 | 1.000000 | -0.008333 | -0.021465 | -0.021465 | ... | -0.008333 | -0.018567 | -0.018567 | -0.007618 | -0.008333 | -0.008333 | -0.008333 | -0.008333 | -0.007524 | -0.008126 |
| "PABPC1P2" | 0.497375 | 0.975214 | 0.600789 | 0.975214 | 0.497375 | -0.008333 | -0.008333 | 1.000000 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.914209 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.902946 | 0.595969 |
| "UGT1A8" | -0.013522 | -0.013187 | -0.013148 | -0.013187 | -0.013522 | -0.021465 | -0.021465 | -0.013522 | 1.000000 | 1.000000 | ... | -0.013522 | -0.030130 | -0.030130 | -0.012362 | -0.013522 | -0.013522 | -0.013522 | -0.013522 | -0.012210 | -0.013187 |
| "UGT1A9" | -0.013522 | -0.013187 | -0.013148 | -0.013187 | -0.013522 | -0.021465 | -0.021465 | -0.013522 | 1.000000 | 1.000000 | ... | -0.013522 | -0.030130 | -0.030130 | -0.012362 | -0.013522 | -0.013522 | -0.013522 | -0.013522 | -0.012210 | -0.013187 |
| "SLC22A14" | 0.497375 | 0.975214 | 0.600789 | 0.975214 | 0.497375 | -0.008333 | -0.008333 | 1.000000 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.914209 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.902946 | 0.595969 |
| "COQ10BP2" | 1.000000 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "PANDAR" | -0.020348 | -0.019843 | -0.019784 | -0.019843 | -0.020348 | -0.032300 | -0.032300 | -0.020348 | 0.001888 | 0.001888 | ... | -0.020348 | 0.003298 | 0.003298 | -0.018602 | -0.020348 | -0.020348 | 0.118817 | 0.118817 | -0.018373 | -0.019843 |
| "LAP3P2" | -0.020348 | -0.019843 | -0.019784 | -0.019843 | -0.020348 | -0.032300 | -0.032300 | -0.020348 | 0.001888 | 0.001888 | ... | -0.020348 | 0.003298 | 0.003298 | -0.018602 | -0.020348 | -0.020348 | 0.118817 | 0.118817 | -0.018373 | -0.019843 |
| "RPL22P16" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | 1.000000 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "GALNT17" | 0.595969 | 1.000000 | 0.719609 | 0.902085 | 0.595969 | -0.008126 | -0.008126 | 0.975214 | -0.013187 | -0.013187 | ... | 0.595969 | -0.011407 | -0.011407 | 0.801883 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.785477 | 0.713849 |
| "PON1" | 0.595969 | 1.000000 | 0.719609 | 0.902085 | 0.595969 | -0.008126 | -0.008126 | 0.975214 | -0.013187 | -0.013187 | ... | 0.595969 | -0.011407 | -0.011407 | 0.801883 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.785477 | 0.713849 |
| "HTR5A" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "SNORA36A" | -0.005593 | -0.005455 | -0.005438 | -0.005455 | -0.005593 | -0.008879 | -0.008879 | -0.005593 | -0.014408 | -0.014408 | ... | -0.005593 | 0.086725 | 0.086725 | -0.005113 | -0.005593 | -0.005593 | -0.005593 | -0.005593 | -0.005050 | -0.005455 |
| "MIR664B" | -0.005593 | -0.005455 | -0.005438 | -0.005455 | -0.005593 | -0.008879 | -0.008879 | -0.005593 | -0.014408 | -0.014408 | ... | -0.005593 | 0.086725 | 0.086725 | -0.005113 | -0.005593 | -0.005593 | -0.005593 | -0.005593 | -0.005050 | -0.005455 |
| "CSMD1" | 0.233664 | 0.785477 | 0.282777 | 0.975655 | 0.233664 | -0.007524 | -0.007524 | 0.902946 | -0.012210 | -0.012210 | ... | 0.233664 | -0.010561 | -0.010561 | 0.999636 | -0.004740 | 0.233664 | -0.004740 | -0.004740 | 1.000000 | 0.280484 |
| "KCNS2" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 1.000000 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 1.000000 | 0.113449 | 0.113449 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.975214 |
| "MIR548AA1" | -0.005119 | -0.004992 | -0.004977 | -0.004992 | -0.005119 | -0.008126 | -0.008126 | -0.005119 | -0.013187 | -0.013187 | ... | -0.005119 | -0.011407 | -0.011407 | -0.004680 | -0.005119 | -0.005119 | -0.005119 | -0.005119 | -0.004622 | -0.004992 |
| "MIR548D1" | -0.005119 | -0.004992 | -0.004977 | -0.004992 | -0.005119 | -0.008126 | -0.008126 | -0.005119 | -0.013187 | -0.013187 | ... | -0.005119 | -0.011407 | -0.011407 | -0.004680 | -0.005119 | -0.005119 | -0.005119 | -0.005119 | -0.004622 | -0.004992 |
| "MTCO2P11" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "CLCN3P1" | 0.251552 | 0.801883 | 0.304354 | 0.981215 | 0.251552 | -0.007618 | -0.007618 | 0.914209 | -0.012362 | -0.012362 | ... | 0.251552 | -0.010693 | -0.010693 | 1.000000 | -0.004799 | 0.251552 | -0.004799 | -0.004799 | 0.999636 | 0.301890 |
| "SUGT1P4-STRA6LP" | 0.037564 | -0.030417 | -0.030325 | -0.030417 | -0.031190 | 0.054004 | 0.054004 | -0.031190 | 0.094395 | 0.094395 | ... | -0.031190 | 0.081243 | 0.081243 | -0.028514 | -0.031190 | -0.031190 | 0.093831 | 0.093831 | -0.028163 | -0.030417 |
| "STRA6LP" | 0.037564 | -0.030417 | -0.030325 | -0.030417 | -0.031190 | 0.054004 | 0.054004 | -0.031190 | 0.094395 | 0.094395 | ... | -0.031190 | 0.081243 | 0.081243 | -0.028514 | -0.031190 | -0.031190 | 0.093831 | 0.093831 | -0.028163 | -0.030417 |
| "MUC6" | 0.600789 | 0.719609 | 1.000000 | 0.452186 | 0.600789 | -0.008102 | -0.008102 | 0.600789 | -0.013148 | -0.013148 | ... | 0.600789 | -0.011372 | -0.011372 | 0.304354 | -0.005104 | 0.600789 | -0.005104 | -0.005104 | 0.282777 | 0.719609 |
| "VSTM4" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 1.000000 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "LINC00856" | -0.008950 | -0.008728 | 0.091960 | -0.008728 | -0.008950 | -0.014208 | -0.014208 | -0.008950 | -0.023056 | -0.023056 | ... | -0.008950 | -0.019943 | -0.019943 | -0.008182 | -0.008950 | -0.008950 | -0.008950 | -0.008950 | -0.008082 | -0.008728 |
| "LINC00595" | -0.008950 | -0.008728 | 0.091960 | -0.008728 | -0.008950 | -0.014208 | -0.014208 | -0.008950 | -0.023056 | -0.023056 | ... | -0.008950 | -0.019943 | -0.019943 | -0.008182 | -0.008950 | -0.008950 | -0.008950 | -0.008950 | -0.008082 | -0.008728 |
| "CACYBPP1" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "LINC00477" | -0.007266 | -0.007086 | -0.007065 | -0.007086 | -0.007266 | -0.011534 | -0.011534 | -0.007266 | -0.018717 | -0.018717 | ... | -0.007266 | 0.103574 | 0.103574 | -0.006643 | -0.007266 | -0.007266 | -0.007266 | -0.007266 | -0.006561 | -0.007086 |
| "KNOP1P1" | -0.007266 | -0.007086 | -0.007065 | -0.007086 | -0.007266 | -0.011534 | -0.011534 | -0.007266 | -0.018717 | -0.018717 | ... | -0.007266 | 0.103574 | 0.103574 | -0.006643 | -0.007266 | -0.007266 | -0.007266 | -0.007266 | -0.006561 | -0.007086 |
| "WDR95P" | 0.374125 | 0.902085 | 0.452186 | 1.000000 | 0.374125 | -0.008126 | -0.008126 | 0.975214 | -0.013187 | -0.013187 | ... | 0.374125 | -0.011407 | -0.011407 | 0.981215 | -0.005119 | 0.374125 | -0.005119 | -0.005119 | 0.975655 | 0.448546 |
| "MIR20A" | -0.005119 | -0.004992 | -0.004977 | -0.004992 | -0.005119 | -0.008126 | -0.008126 | -0.005119 | -0.013187 | -0.013187 | ... | -0.005119 | -0.011407 | -0.011407 | -0.004680 | -0.005119 | -0.005119 | -0.005119 | -0.005119 | -0.004622 | -0.004992 |
| "MIR19B1" | -0.005119 | -0.004992 | -0.004977 | -0.004992 | -0.005119 | -0.008126 | -0.008126 | -0.005119 | -0.013187 | -0.013187 | ... | -0.005119 | -0.011407 | -0.011407 | -0.004680 | -0.005119 | -0.005119 | -0.005119 | -0.005119 | -0.004622 | -0.004992 |
| "RPL21P5" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "RNU6-539P" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "SNRPN" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "SNURF" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "RBFOX1" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "LINC02183" | 0.595969 | 0.713849 | 0.719609 | 0.448546 | 0.975214 | -0.008126 | -0.008126 | 0.595969 | -0.013187 | -0.013187 | ... | 0.975214 | 0.083019 | 0.083019 | 0.301890 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.280484 | 1.000000 |
| "MT1M" | 0.595969 | 0.713849 | 0.719609 | 0.448546 | 0.595969 | -0.008126 | -0.008126 | 0.595969 | -0.013187 | -0.013187 | ... | 0.595969 | -0.011407 | -0.011407 | 0.301890 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.280484 | 0.713849 |
| "ASPA" | 0.595969 | 0.713849 | 0.719609 | 0.448546 | 0.595969 | -0.008126 | -0.008126 | 0.595969 | -0.013187 | -0.013187 | ... | 0.595969 | -0.011407 | -0.011407 | 0.301890 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.280484 | 0.713849 |
| "BCL6B" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 1.000000 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 1.000000 | 0.113449 | 0.113449 | 0.251552 | -0.005249 | 0.497375 | -0.005249 | -0.005249 | 0.233664 | 0.975214 |
| "CCL3L3" | -0.011697 | -0.011407 | -0.011372 | -0.011407 | 0.113449 | -0.018567 | -0.018567 | -0.011697 | -0.030130 | -0.030130 | ... | 0.113449 | 1.000000 | 1.000000 | -0.010693 | -0.011697 | -0.011697 | -0.011697 | -0.011697 | -0.010561 | 0.083019 |
| "CCL3L1" | -0.011697 | -0.011407 | -0.011372 | -0.011407 | 0.113449 | -0.018567 | -0.018567 | -0.011697 | -0.030130 | -0.030130 | ... | 0.113449 | 1.000000 | 1.000000 | -0.010693 | -0.011697 | -0.011697 | -0.011697 | -0.011697 | -0.010561 | 0.083019 |
| "OTOP3" | 0.251552 | 0.801883 | 0.304354 | 0.981215 | 0.251552 | -0.007618 | -0.007618 | 0.914209 | -0.012362 | -0.012362 | ... | 0.251552 | -0.010693 | -0.010693 | 1.000000 | -0.004799 | 0.251552 | -0.004799 | -0.004799 | 0.999636 | 0.301890 |
| "RNA5SP450" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | 1.000000 | -0.005249 | -0.005249 | -0.005249 | -0.004740 | -0.005119 |
| "PSG1" | 0.497375 | 0.595969 | 0.600789 | 0.374125 | 0.497375 | -0.008333 | -0.008333 | 0.497375 | -0.013522 | -0.013522 | ... | 0.497375 | -0.011697 | -0.011697 | 0.251552 | -0.005249 | 1.000000 | -0.005249 | -0.005249 | 0.233664 | 0.595969 |
| "MIR3190" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | 1.000000 | 1.000000 | -0.004740 | -0.005119 |
| "MIR3191" | -0.005249 | -0.005119 | -0.005104 | -0.005119 | -0.005249 | -0.008333 | -0.008333 | -0.005249 | -0.013522 | -0.013522 | ... | -0.005249 | -0.011697 | -0.011697 | -0.004799 | -0.005249 | -0.005249 | 1.000000 | 1.000000 | -0.004740 | -0.005119 |
| "SEZ6L" | 0.233664 | 0.785477 | 0.282777 | 0.975655 | 0.233664 | -0.007524 | -0.007524 | 0.902946 | -0.012210 | -0.012210 | ... | 0.233664 | -0.010561 | -0.010561 | 0.999636 | -0.004740 | 0.233664 | -0.004740 | -0.004740 | 1.000000 | 0.280484 |
| "ADAMTS5" | 0.595969 | 0.713849 | 0.719609 | 0.448546 | 0.975214 | -0.008126 | -0.008126 | 0.595969 | -0.013187 | -0.013187 | ... | 0.975214 | 0.083019 | 0.083019 | 0.301890 | -0.005119 | 0.595969 | -0.005119 | -0.005119 | 0.280484 | 1.000000 |
56 rows × 56 columns
We create the dataset without duplicates
df_mcf7_allVars_log2_noDup = df_mcf7_allVars_log2.drop_duplicates()
#df_noDup
100*(len(df_mcf7_allVars_log2)- len(df_mcf7_allVars_log2_noDup))/len(df_mcf7_allVars_log2)
0.12644981250545043
We removed less than 1% of the dataset
We are investigating the correlations between the samples (i.e. the correlation between gene expression profiles of different cells):
plt.figure(figsize=(10,5))
#df_small = df.iloc[:, :50]
#c= df_small.corr()
c= df_mcf7_allVars_log2_noDup.corr()
midpoint = (c.values.max() - c.values.min()) /2 + c.values.min()
#sns.heatmap(c,cmap='coolwarm',annot=True, center=midpoint )
sns.heatmap(c,cmap='coolwarm', center=0 )
print("Number of cells included: ", np.shape(c))
print("Average correlation of expression profiles between cells: ", midpoint)
print("Min. correlation of expression profiles between cells: ", c.values.min())
Number of cells included: (383, 383) Average correlation of expression profiles between cells: 0.4970002710134507 Min. correlation of expression profiles between cells: -0.005999457973098509
We see that the correlation matrix of cells contains high values and therefore is mostly red. There are some white striped that indicate the presence of cells that are not correlated with other cells.
For each cell we calculate how many low correlated cells there are. For low correlation we defined the correlation threshold as a range between +/- 0.2:
df_lowCorr_info = c[(c < 0.2) & (c>-0.2)].count().reset_index().rename(columns={'index':'cell', 0:'n_lowCorr_cells'})
df_lowCorr_info
| cell | n_lowCorr_cells | |
|---|---|---|
| 0 | "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCo... | 14 |
| 1 | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCo... | 382 |
| 2 | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCo... | 14 |
| 3 | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoor... | 13 |
| 4 | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoor... | 13 |
| ... | ... | ... |
| 378 | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCo... | 14 |
| 379 | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCo... | 14 |
| 380 | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCo... | 14 |
| 381 | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCo... | 14 |
| 382 | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCo... | 14 |
383 rows × 2 columns
Let's define the 'uncorrelated cell group' as the group of cells that have low correlation with at least half of the other cells:
df_lowCorr_info[df_lowCorr_info['n_lowCorr_cells']> 383/2]
| cell | n_lowCorr_cells | |
|---|---|---|
| 1 | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCo... | 382 |
| 23 | "output.STAR.1_B9_Hypo_S75_Aligned.sortedByCoo... | 379 |
| 46 | "output.STAR.1_D8_Hypo_S170_Aligned.sortedByCo... | 378 |
| 51 | "output.STAR.1_E1_Norm_S193_Aligned.sortedByCo... | 340 |
| 60 | "output.STAR.1_F10_Hypo_S268_Aligned.sortedByC... | 364 |
| 74 | "output.STAR.1_G12_Hypo_S318_Aligned.sortedByC... | 382 |
| 87 | "output.STAR.1_H1_Norm_S337_Aligned.sortedByCo... | 382 |
| 91 | "output.STAR.1_H5_Norm_S341_Aligned.sortedByCo... | 382 |
| 118 | "output.STAR.2_B8_Hypo_S80_Aligned.sortedByCoo... | 378 |
| 142 | "output.STAR.2_D8_Hypo_S176_Aligned.sortedByCo... | 382 |
| 240 | "output.STAR.3_E10_Hypo_S232_Aligned.sortedByC... | 375 |
| 245 | "output.STAR.3_E3_Norm_S207_Aligned.sortedByCo... | 381 |
| 249 | "output.STAR.3_E7_Hypo_S229_Aligned.sortedByCo... | 380 |
| 295 | "output.STAR.4_A5_Norm_S23_Aligned.sortedByCoo... | 382 |
print(len(df_lowCorr_info[df_lowCorr_info['n_lowCorr_cells']> 383/2]))
14
14 cells are expressing very different gene profiles (i.e. their correlations with almost all of the cells is between -0.2 and 0.2)
uncorrelated_cells = df_lowCorr_info.cell.tolist()
df_mcf7_allVars_log2_noDup[uncorrelated_cells].describe(percentiles=[0.05,0.25,0.5,0.75,0.95]).round(2)
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | ... | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 | 22905.00 |
| mean | 1.89 | 0.01 | 1.73 | 0.41 | 1.57 | 2.18 | 2.54 | 2.60 | 2.51 | 0.63 | ... | 1.66 | 2.37 | 0.51 | 1.97 | 1.75 | 1.63 | 2.15 | 2.22 | 2.37 | 2.30 |
| std | 2.74 | 0.12 | 3.06 | 0.93 | 2.16 | 2.94 | 3.17 | 3.03 | 3.11 | 1.18 | ... | 2.21 | 2.86 | 1.88 | 2.66 | 2.43 | 2.24 | 2.94 | 3.00 | 3.10 | 2.99 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 5% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 75% | 4.17 | 0.00 | 2.58 | 0.00 | 3.00 | 4.58 | 5.32 | 5.17 | 5.29 | 1.00 | ... | 3.32 | 4.95 | 0.00 | 4.17 | 3.70 | 3.32 | 4.81 | 4.95 | 5.29 | 5.09 |
| 95% | 7.24 | 0.00 | 8.26 | 2.58 | 6.00 | 7.93 | 8.41 | 8.09 | 8.21 | 3.32 | ... | 6.00 | 7.47 | 5.73 | 7.11 | 6.58 | 6.11 | 7.67 | 7.79 | 8.01 | 7.73 |
| max | 15.51 | 3.91 | 16.32 | 8.18 | 13.37 | 15.52 | 14.85 | 15.64 | 15.15 | 10.74 | ... | 14.12 | 14.51 | 16.32 | 14.85 | 13.57 | 14.24 | 14.77 | 15.31 | 15.50 | 16.07 |
10 rows × 383 columns
Low correlated group of cells are having 0 in at least half of their data points, some have even 3/4 of their distribution as 0.
We can also look at the cells that are highly correlated with other cells in the same way. We define the high correlation threshold as values greater than 0.75 and less than -0.75:
df_highCorr_info = c[(c < -0.75) | (c> 0.75)].count().reset_index().rename(columns={'index':'cell', 0:'n_highCorr_cells'})
print(len(df_highCorr_info[df_highCorr_info['n_highCorr_cells']> 383/2]))
151
Half of the cells are highly correlated with at least half of the other cells
df_highCorr_info.sort_values(by='n_highCorr_cells', ascending=False).head(5)
| cell | n_highCorr_cells | |
|---|---|---|
| 373 | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByC... | 288 |
| 132 | "output.STAR.2_D10_Hypo_S178_Aligned.sortedByC... | 284 |
| 338 | "output.STAR.4_E12_Hypo_S240_Aligned.sortedByC... | 282 |
| 116 | "output.STAR.2_B6_Norm_S60_Aligned.sortedByCoo... | 277 |
| 130 | "output.STAR.2_C8_Hypo_S128_Aligned.sortedByCo... | 274 |
These cells above are correlated with more than half of the cells.
c
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H2_Norm_S356_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | 1.000000 | 0.092392 | 0.641451 | 0.552030 | 0.740154 | 0.715509 | 0.698258 | 0.733852 | 0.722427 | 0.704852 | ... | 0.802480 | 0.704740 | 0.354362 | 0.706341 | 0.717634 | 0.714914 | 0.689291 | 0.735592 | 0.758200 | 0.772503 |
| "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | 0.092392 | 1.000000 | 0.077445 | 0.093890 | 0.086249 | 0.078653 | 0.078769 | 0.077744 | 0.079648 | 0.110500 | ... | 0.095141 | 0.078520 | 0.076843 | 0.090443 | 0.084460 | 0.089407 | 0.079306 | 0.084042 | 0.093441 | 0.095487 |
| "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | 0.641451 | 0.077445 | 1.000000 | 0.491788 | 0.591907 | 0.602307 | 0.591729 | 0.602359 | 0.609478 | 0.576598 | ... | 0.660210 | 0.616690 | 0.344620 | 0.617572 | 0.621140 | 0.618269 | 0.604898 | 0.673522 | 0.643244 | 0.646136 |
| "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | 0.552030 | 0.093890 | 0.491788 | 1.000000 | 0.627502 | 0.596699 | 0.582033 | 0.589526 | 0.601294 | 0.616554 | ... | 0.604512 | 0.576742 | 0.418460 | 0.589078 | 0.627158 | 0.622275 | 0.575052 | 0.504048 | 0.527658 | 0.521477 |
| "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | 0.740154 | 0.086249 | 0.591907 | 0.627502 | 1.000000 | 0.823429 | 0.823665 | 0.860464 | 0.798215 | 0.736518 | ... | 0.814914 | 0.770448 | 0.416724 | 0.799517 | 0.794924 | 0.800594 | 0.747523 | 0.682869 | 0.721651 | 0.716155 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | 0.714914 | 0.089407 | 0.618269 | 0.622275 | 0.800594 | 0.809851 | 0.787271 | 0.810367 | 0.780386 | 0.700814 | ... | 0.789713 | 0.792814 | 0.418525 | 0.799382 | 0.778004 | 1.000000 | 0.734664 | 0.699840 | 0.713377 | 0.704496 |
| "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | 0.689291 | 0.079306 | 0.604898 | 0.575052 | 0.747523 | 0.742690 | 0.736827 | 0.762113 | 0.749664 | 0.647589 | ... | 0.742599 | 0.735169 | 0.398002 | 0.760377 | 0.754347 | 0.734664 | 1.000000 | 0.685047 | 0.717097 | 0.700175 |
| "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | 0.735592 | 0.084042 | 0.673522 | 0.504048 | 0.682869 | 0.698498 | 0.690364 | 0.713558 | 0.713611 | 0.641984 | ... | 0.766088 | 0.719936 | 0.329851 | 0.708870 | 0.691307 | 0.699840 | 0.685047 | 1.000000 | 0.763240 | 0.776673 |
| "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | 0.758200 | 0.093441 | 0.643244 | 0.527658 | 0.721651 | 0.711792 | 0.708398 | 0.748183 | 0.732998 | 0.674184 | ... | 0.789730 | 0.728287 | 0.335161 | 0.727184 | 0.715503 | 0.713377 | 0.717097 | 0.763240 | 1.000000 | 0.789785 |
| "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | 0.772503 | 0.095487 | 0.646136 | 0.521477 | 0.716155 | 0.704285 | 0.702876 | 0.740780 | 0.735109 | 0.677433 | ... | 0.808588 | 0.724120 | 0.330594 | 0.714528 | 0.711640 | 0.704496 | 0.700175 | 0.776673 | 0.789785 | 1.000000 |
383 rows × 383 columns
Let's look at the correlation between Hypoxia cells:
hypo_cells = [elem for elem in df_mcf7_allVars_log2_noDup.columns.tolist() if 'Hypo' in elem ]
df_corr_hypo_cells = c[c.index.isin(hypo_cells)]
df_corr_hypo_cells = df_corr_hypo_cells[hypo_cells]
midpoint_hypo = (df_corr_hypo_cells.values.max() - df_corr_hypo_cells.values.min()) /2 + df_corr_hypo_cells.values.min()
print("Number of cells included: ", np.shape(df_corr_hypo_cells))
print("Average correlation of expression profiles between hypoxia cells: ", midpoint_hypo)
#df_mcf7_allVars_log2_noDup.corr()
Number of cells included: (191, 191) Average correlation of expression profiles between hypoxia cells: 0.49759528961067045
Let's look at the correlation between Normal cells:
no_hypo_cells = [elem for elem in df_mcf7_allVars_log2_noDup.columns.tolist() if 'Hypo' not in elem ]
df_corr_nohypo_cells = c[c.index.isin(no_hypo_cells)]
df_corr_nohypo_cells = df_corr_nohypo_cells[no_hypo_cells]
midpoint_nohypo = (df_corr_nohypo_cells.values.max() - df_corr_nohypo_cells.values.min()) /2 + df_corr_nohypo_cells.values.min()
print("Number of cells included: ", np.shape(df_corr_nohypo_cells))
print("Average correlation of expression profiles between hypoxia cells: ", midpoint_nohypo)
#df_mcf7_allVars_log2_noDup.corr()
Number of cells included: (192, 192) Average correlation of expression profiles between hypoxia cells: 0.4986955090694124
The average correlation within the two cell groups (low oxygen condition cells and high oxygen condition) is similar. That means high oxygen cells are not more similar to each other than how much similar low oxygen cells to each other.
We choose 5 random cells from high oxygen condition and then 5 random cells from low oxygen condition and look at their distributions:
random.seed(1111)
random_vars = [randint(0,191) for i in range(0,5)]
sns.histplot(df_corr_nohypo_cells.iloc[:,random_vars],bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')
Text(0.5, 0, 'Correlation between cells expression profiles')
In no hypoxia condition, we chose 4 random cells that have high correlations with other cells, and one cell (in purple) that has lower correlation with other no hypoxia cells
random_vars = [randint(0,191) for i in range(0,5)]
sns.histplot(df_corr_hypo_cells.iloc[:,random_vars],bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')
Text(0.5, 0, 'Correlation between cells expression profiles')
In hypoxia condition, we chose 4 random cells that have high correlations with other cells, and one cell (in orange) that has lower correlation with other hypoxia cells
df_mcf7_allVars_log2.T.shape
(383, 22934)
We also check the correlations between the features (i.e. the expressions of different genes) as requested.
It take too long to check all the features, therefore we use only 5% of the features for this exercise:
df_mcf7_allVars_log2_noDup.iloc[:,0:20].T
| "WASH7P" | "MIR6859-1" | "WASH9P" | "OR4F29" | "MTND1P23" | "MTND2P28" | "MTCO1P12" | "MTCO2P12" | "MTATP8P1" | "MTATP6P1" | ... | "MT-TH" | "MT-TS2" | "MT-TL2" | "MT-ND5" | "MT-ND6" | "MT-TE" | "MT-CYB" | "MT-TT" | "MT-TP" | "MAFIP" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.1_A10_Hypo_S28_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.584963 | 1.584963 | 0.000000 | 0.0 | 4.906891 | ... | 0.000000 | 0.0 | 0.000000 | 8.982994 | 7.209453 | 2.321928 | 8.082149 | 0.000000 | 2.584963 | 3.169925 |
| "output.STAR.1_A11_Hypo_S29_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 1.000000 | 1.0 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_A12_Hypo_S30_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 1.000000 | 1.000000 | 0.0 | 3.700440 | ... | 0.000000 | 0.0 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 6.266787 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_A1_Norm_S1_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 3.000000 | ... | 1.000000 | 0.0 | 0.000000 | 5.491853 | 3.169925 | 0.000000 | 6.066089 | 0.000000 | 1.000000 | 0.000000 |
| "output.STAR.1_A2_Norm_S2_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 6.108524 | ... | 0.000000 | 0.0 | 0.000000 | 7.894818 | 5.000000 | 2.000000 | 9.507795 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_A3_Norm_S3_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 1.000000 | 2.000000 | 0.000000 | 0.0 | 7.672425 | ... | 1.000000 | 0.0 | 0.000000 | 9.805744 | 6.832890 | 2.000000 | 11.408330 | 1.000000 | 1.000000 | 0.000000 |
| "output.STAR.1_A4_Norm_S4_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 3.459432 | 0.0 | 0.0 | 2.000000 | 3.459432 | 1.000000 | 0.0 | 9.434628 | ... | 1.000000 | 0.0 | 2.000000 | 10.321928 | 7.044394 | 0.000000 | 13.187197 | 1.000000 | 1.000000 | 0.000000 |
| "output.STAR.1_A5_Norm_S5_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 2.807355 | 2.584963 | 1.584963 | 0.0 | 7.693487 | ... | 0.000000 | 1.0 | 2.321928 | 9.426265 | 6.285402 | 0.000000 | 11.667999 | 1.000000 | 2.321928 | 1.000000 |
| "output.STAR.1_A6_Norm_S6_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 4.087463 | 2.584963 | 1.000000 | 1.0 | 9.583083 | ... | 1.584963 | 2.0 | 3.321928 | 11.359750 | 8.625709 | 3.906891 | 12.956739 | 2.321928 | 1.584963 | 0.000000 |
| "output.STAR.1_A7_Hypo_S25_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 2.321928 | ... | 0.000000 | 0.0 | 0.000000 | 5.247928 | 3.700440 | 1.000000 | 4.954196 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_A8_Hypo_S26_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 2.000000 | 1.584963 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_A9_Hypo_S27_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 3.321928 | ... | 0.000000 | 0.0 | 0.000000 | 5.727920 | 3.584963 | 1.584963 | 5.832890 | 0.000000 | 1.000000 | 0.000000 |
| "output.STAR.1_B10_Hypo_S76_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 2.584963 | 0.0 | 0.0 | 4.000000 | 1.000000 | 0.000000 | 0.0 | 8.375039 | ... | 3.000000 | 1.0 | 2.000000 | 10.424166 | 7.930737 | 3.321928 | 11.478770 | 1.584963 | 2.584963 | 0.000000 |
| "output.STAR.1_B11_Hypo_S77_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 2.321928 | 0.0 | 0.0 | 2.000000 | 2.584963 | 0.000000 | 0.0 | 5.554589 | ... | 0.000000 | 0.0 | 1.000000 | 9.857981 | 8.076816 | 2.000000 | 9.157347 | 2.321928 | 3.000000 | 0.000000 |
| "output.STAR.1_B12_Hypo_S78_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.0 | 0.000000 | 1.584963 | 0.000000 | 0.0 | 4.807355 | ... | 0.000000 | 0.0 | 0.000000 | 9.002815 | 7.189825 | 0.000000 | 7.754888 | 0.000000 | 1.584963 | 2.807355 |
| "output.STAR.1_B1_Norm_S49_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 1.000000 | 0.0 | 4.392317 | ... | 0.000000 | 0.0 | 0.000000 | 5.930737 | 3.906891 | 1.584963 | 7.467606 | 0.000000 | 0.000000 | 1.000000 |
| "output.STAR.1_B2_Norm_S50_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 1.000000 | 0.000000 | 0.000000 | 0.0 | 5.129283 | ... | 0.000000 | 0.0 | 0.000000 | 7.022368 | 4.321928 | 1.584963 | 8.219169 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.1_B3_Norm_S51_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 4.459432 | ... | 0.000000 | 0.0 | 1.584963 | 8.154818 | 5.459432 | 1.584963 | 9.197217 | 0.000000 | 0.000000 | 1.584963 |
| "output.STAR.1_B4_Norm_S52_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 5.584963 | ... | 0.000000 | 0.0 | 0.000000 | 7.693487 | 4.247928 | 0.000000 | 8.939579 | 0.000000 | 1.584963 | 0.000000 |
| "output.STAR.1_B5_Norm_S53_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 2.807355 | 0.0 | 0.0 | 3.807355 | 2.807355 | 1.000000 | 0.0 | 8.535275 | ... | 0.000000 | 0.0 | 1.584963 | 10.179909 | 6.599913 | 0.000000 | 11.897845 | 2.584963 | 2.321928 | 0.000000 |
20 rows × 22905 columns
corr_features_mcf7 = df_mcf7_allVars_log2_noDup.iloc[0:20].T.corr()
sns.heatmap(corr_features_mcf7,cmap='coolwarm', center=0)
<AxesSubplot:>
Just looking at the first 20 features, we notice on the correlation matrix red areas that indicate high positive correlations. We also notice some negative correlations (but not that high).
Features with high correlations can be problematic for some machine learning algorithms. This problem is known as multicollinearity. In order to solve it, among a highly correlated pair of features, one should not be used in the model.
---------------------- Memory Cleaning Start --------------------------------
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
['_10', '_18', '_3', '_30', '_37', '_39', '_46', '_47', '_49', '_53', '_54', '_57', '_59', '_60', '_66', '_7', '__', 'c', 'c_dupl', 'corr_features_mcf7', 'df_corr_hypo_cells', 'df_corr_nohypo_cells', 'df_highCorr_info', 'df_info_sparsity_th50', 'df_info_sparsity_th50_feature', 'df_info_sparsity_th95', 'df_info_sparsity_th95_feature', 'df_lowCorr_info', 'df_mcf7_50vars_log2', 'df_mcf7_allVars_log2', 'df_mcf7_allVars_log2_noDup', 'df_mcf7_allVars_log2_norm', 'df_mcf7_allVars_log2_small', 'duplicate_rows_df_mcf7_allVars_log2', 'duplicate_rows_df_mcf7_allVars_log2_t', 'mcf7_smarts_filtered', 'mcf7_smarts_filtered_normalized', 'mcf7_smarts_metadata', 'mcf7_smarts_unfiltered', 'mcf7_smarts_unfiltered_noOut']
for elem in alldfs:
exec('del ' + elem)
import gc
gc.collect()
239
#mcf7_smarts_unfiltered # control if deleted
---------------------- Memory Cleaning End --------------------------------
HCC1806_smarts_metadata = pd.read_csv("SmartSeq/HCC1806_SmartS_MetaData.tsv",delimiter="\t",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_smarts_metadata))
print("First column: ", HCC1806_smarts_metadata.iloc[ : , 0])
Dataframe dimensions: (243, 8)
First column: Filename
output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam HCC1806
...
output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam HCC1806
output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam HCC1806
Name: Cell Line, Length: 243, dtype: object
HCC1806_smarts_metadata
| Cell Line | PCR Plate | Pos | Condition | Hours | Cell name | PreprocessingTag | ProcessingComments | |
|---|---|---|---|---|---|---|---|---|
| Filename | ||||||||
| output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam | HCC1806 | 1 | A10 | Normo | 24 | S123 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam | HCC1806 | 1 | A12 | Normo | 24 | S26 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam | HCC1806 | 1 | A1 | Hypo | 24 | S97 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam | HCC1806 | 1 | A2 | Hypo | 24 | S104 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam | HCC1806 | 1 | A3 | Hypo | 24 | S4 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam | HCC1806 | 4 | H10 | Normo | 24 | S210 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam | HCC1806 | 4 | H11 | Normo | 24 | S214 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam | HCC1806 | 4 | H2 | Hypo | 24 | S199 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam | HCC1806 | 4 | H7 | Normo | 24 | S205 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
| output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam | HCC1806 | 4 | H9 | Normo | 24 | S236 | Aligned.sortedByCoord.out.bam | STAR,FeatureCounts |
243 rows × 8 columns
HCC1806_smarts_metadata.shape
(243, 8)
HCC1806_smarts_metadata['Cell name'].nunique()
243
We are now proceding with the second data set. It can be seen that, the structure of the data is exactly the same. As stated for the previous mcf7_smarts_metadata, the indices of the the indices of the data frame HCC1806_smarts_metadata denote the filenames that represent polyeptid chains defining each cell studied. Again, like the mcf7_smarts_metadata, our present dataframe contains 8 columns:
We can see that each filename is created by combining the information contained in some of the columns. For example, the fist row has the filename output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam which stands for the analysis of the cell S123 in the position A10 that is preprocessed by alining and storing by the coordinates and the oxygen level of the cell Normoxia.
HCC1806_smarts_unfiltered = pd.read_csv("SmartSeq/HCC1806_SmartS_Unfiltered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_smarts_unfiltered))
print("First column: ", HCC1806_smarts_unfiltered.iloc[ : , 0])
Dataframe dimensions: (23396, 243)
First column: "WASH7P" 0
"CICP27" 0
"DDX11L17" 0
"WASH9P" 0
"OR4F29" 2
...
"MT-TE" 22
"MT-CYB" 4208
"MT-TT" 26
"MT-TP" 66
"MAFIP" 0
Name: "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam", Length: 23396, dtype: int64
HCC1806_smarts_unfiltered
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "WASH7P" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "CICP27" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "DDX11L17" | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| "WASH9P" | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| "OR4F29" | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "MT-TE" | 22 | 43 | 0 | 0 | 0 | 3 | 47 | 4 | 2 | 8 | ... | 24 | 15 | 15 | 4 | 4 | 26 | 1 | 4 | 4 | 20 |
| "MT-CYB" | 4208 | 6491 | 25 | 4819 | 310 | 695 | 2885 | 1552 | 366 | 1829 | ... | 1119 | 1429 | 808 | 999 | 916 | 3719 | 984 | 2256 | 981 | 2039 |
| "MT-TT" | 26 | 62 | 0 | 11 | 4 | 0 | 41 | 9 | 2 | 8 | ... | 48 | 31 | 3 | 8 | 5 | 42 | 1 | 15 | 6 | 34 |
| "MT-TP" | 66 | 71 | 1 | 3 | 9 | 14 | 91 | 22 | 3 | 30 | ... | 119 | 52 | 11 | 22 | 15 | 48 | 18 | 36 | 8 | 79 |
| "MAFIP" | 0 | 4 | 0 | 7 | 0 | 9 | 0 | 4 | 2 | 0 | ... | 2 | 0 | 2 | 1 | 1 | 3 | 0 | 2 | 1 | 5 |
23396 rows × 243 columns
Each column of the dataframe HCC1806_smarts_unfiltered (243 columns) corresponds to the row of the dataframe HCC1806_smarts_meta (243 rows). Thus, in our dataset we know the quantity of the expressed genes in each file (i.e. the aligned polyeptid sequences).
The indices of this unfiltered dataframe represent a gene name (WASH7P, MT-TT, etc.) that is a special ID known as gene symbols. These gene symbols are only acroynms, that might not be unique. Below we verify that the dataset contains only unique gene acronyms:
HCC1806_smarts_unfiltered.reset_index()['index'].nunique() == HCC1806_smarts_unfiltered.shape[0]
True
The unfiltered dataframe contains only numeric information:
set(list(HCC1806_smarts_unfiltered.dtypes))
{dtype('int64')}
Do we have any missing data? We can look at the null values row by row, but since we have 243 rows, we look at the total sum of the missing values in each row:
HCC1806_smarts_unfiltered.isnull().sum().sum()
0
There are no missing values in our dataframe, therefore there is no need for imputation.
We can look at the descriptive statistics of our dataframe:
HCC1806_smarts_unfiltered.describe(percentiles=[.05, .25, .5, .75, .95])
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | ... | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 |
| mean | 99.565695 | 207.678278 | 9.694734 | 150.689007 | 35.700504 | 47.088434 | 152.799453 | 135.869422 | 38.363908 | 45.512139 | ... | 76.361771 | 105.566593 | 54.026116 | 29.763806 | 28.905411 | 104.740725 | 35.181569 | 108.197940 | 37.279962 | 76.303855 |
| std | 529.532443 | 981.107905 | 65.546050 | 976.936548 | 205.885369 | 545.367706 | 864.974182 | 870.729740 | 265.062493 | 366.704721 | ... | 346.659348 | 536.881574 | 344.068304 | 186.721266 | 135.474736 | 444.773045 | 170.872090 | 589.082268 | 181.398951 | 369.090274 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 5% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 75% | 51.000000 | 125.000000 | 5.000000 | 40.000000 | 22.000000 | 17.000000 | 81.000000 | 76.000000 | 22.000000 | 18.000000 | ... | 56.000000 | 67.000000 | 29.000000 | 18.000000 | 19.000000 | 76.000000 | 24.000000 | 68.000000 | 22.000000 | 44.000000 |
| 95% | 425.000000 | 917.000000 | 39.000000 | 700.250000 | 153.000000 | 171.000000 | 644.000000 | 564.000000 | 169.000000 | 180.000000 | ... | 341.000000 | 458.500000 | 217.000000 | 118.250000 | 130.000000 | 458.000000 | 155.000000 | 469.000000 | 164.250000 | 342.250000 |
| max | 35477.000000 | 69068.000000 | 6351.000000 | 70206.000000 | 17326.000000 | 47442.000000 | 43081.000000 | 62813.000000 | 30240.000000 | 35450.000000 | ... | 19629.000000 | 30987.000000 | 21894.000000 | 13457.000000 | 11488.000000 | 33462.000000 | 15403.000000 | 34478.000000 | 10921.000000 | 28532.000000 |
10 rows × 243 columns
Just by giving a quick look at the distributions of the features, we see that many variables are highly rigt skewed as they are filled with 0 values. The features are not normalized, they do not have unit variance and zero mean, std deviation is higly large compared to the mean. Their distributions have an elongated right tail.
Let's choose 10 random variables, to visualize their not-normal distributions:
HCC1806_smarts_unfiltered.shape[1]
243
np.random.seed(0)
range_max = HCC1806_smarts_unfiltered.shape[1]-1
random_variable_indices = [randint(0,range_max) for i in range(0,10)]
print(random_variable_indices)
for i in random_variable_indices:
sns.displot(
HCC1806_smarts_unfiltered,
x= HCC1806_smarts_unfiltered.columns.tolist()[i],
kind="kde"
)
[47, 23, 226, 153, 86, 168, 174, 27, 112, 37]
We continue exploratory data analysis with investigating outleirs. We choose to use IQR formula to detect the outliers. Anything outside this range, will be dropped:
Q1 = HCC1806_smarts_unfiltered.quantile(0.25)
Q3 = HCC1806_smarts_unfiltered.quantile(0.75)
IQR = Q3 - Q1
print(IQR)
"output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" 51.0
"output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" 125.0
"output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" 5.0
"output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" 40.0
"output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" 22.0
...
"output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" 76.0
"output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" 24.0
"output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" 68.0
"output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" 22.0
"output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" 44.0
Length: 243, dtype: float64
HCC1806_smarts_unfiltered_noOut = HCC1806_smarts_unfiltered[~((HCC1806_smarts_unfiltered < (Q1 - 1.5 * IQR)) |(HCC1806_smarts_unfiltered > (Q3 + 1.5 * IQR))).any(axis=1)]
print(HCC1806_smarts_unfiltered_noOut.shape)
(10815, 243)
HCC1806_smarts_unfiltered.shape
(23396, 243)
100*(23396-10815)/23396
53.774149427252524
Using interquartile range method to remove the outliers results in taking away 54% of our dataset which is not a desired outcome. As we observed above, many features are filled with 0s.
We can quantify sparisty this way: if X% of the observations of a variable is 0 then it is not a very informative feature. If a dataframe has mostly sparse features then we can say it is a sparse structure.
HCC1806_smarts_unfiltered_info_sparsity_th95 = (
pd.DataFrame(HCC1806_smarts_unfiltered.apply(lambda x: variable_sparsity(x, 0.95), axis=0))
.reset_index()
.rename(columns={'index':'variable', 0:'flag_sparsity'})
)
HCC1806_smarts_unfiltered_info_sparsity_th90 = (
pd.DataFrame(HCC1806_smarts_unfiltered.apply(lambda x: variable_sparsity(x, 0.90), axis=0))
.reset_index()
.rename(columns={'index':'variable', 0:'flag_sparsity'})
)
HCC1806_smarts_unfiltered_info_sparsity_th90
| variable | flag_sparsity | |
|---|---|---|
| 0 | "output.STAR.PCRPlate1A10_Normoxia_S123_Aligne... | 0 |
| 1 | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned... | 0 |
| 2 | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.s... | 0 |
| 3 | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.... | 0 |
| 4 | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.so... | 0 |
| ... | ... | ... |
| 238 | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligne... | 0 |
| 239 | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligne... | 0 |
| 240 | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.... | 0 |
| 241 | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned... | 0 |
| 242 | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned... | 0 |
243 rows × 2 columns
len(HCC1806_smarts_unfiltered_info_sparsity_th95[HCC1806_smarts_unfiltered_info_sparsity_th95.flag_sparsity ==1])
8
len(HCC1806_smarts_unfiltered_info_sparsity_th90[HCC1806_smarts_unfiltered_info_sparsity_th90.flag_sparsity ==1])
9
We defined two thresolds for sparsity: 95% and 90%. Decreasing the thresold, we find more sparse features. This can be advantageous because it would help the algorithms detect easily the difference between two classes (assuming that one class has 0 values and the positive class perhaps has non-zero values).
As we noted earlier, looking at the descriptive statistics and some density plots, the variables are highly centered around zero. Let's quantify the skewness and kurtosis:
colN2 = HCC1806_smarts_unfiltered.shape[1]
list2_skew_cells = []
for i in range(colN2) :
v_df2 = HCC1806_smarts_unfiltered[HCC1806_smarts_unfiltered.columns.tolist()[i]]
list2_skew_cells += [skew(v_df2)]
# df_skew_cells += [df[cnames[i]].skew()]
list2_skew_cells
sns.histplot(list2_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
Text(0.5, 0, 'Skewness of single cells expression profiles - original df')
list2_skew_cells
[29.060170892188808, 26.375803364082593, 50.42219295095526, 39.962261084335715, 40.831761136535434, 63.23072175896631, 26.321572538057914, 38.16336782098296, 71.1522146807669, 56.09724027024084, 48.99814039787873, 18.464775901153637, 47.15195689494234, 120.05593935544216, 33.99427128862153, 30.088231341030692, 22.286514610762975, 32.3921796727025, 38.75514237474577, 28.75570474961464, 31.127814981546333, 27.7243909375453, 29.148368072000252, 57.96588411901668, 36.200031752886005, 31.916505070774285, 28.849458624684896, 26.591278048758557, 31.005757883624007, 23.96877960582696, 65.75883126384987, 32.81420818258468, 37.862153037163516, 16.33822134075528, 37.031403931895355, 24.588248881976064, 26.540459898858327, 33.74323287097885, 24.719083908680908, 36.34765567523466, 43.81950268488599, 22.547359217928758, 38.00948983833193, 52.87075001506269, 28.831690939327846, 41.27143769895153, 23.900651442148476, 69.1193979984549, 23.274249122692947, 23.07242432362229, 40.34033497278627, 24.723445511027933, 26.873455662483146, 29.337049856035584, 31.912072664409816, 28.64589741988196, 50.24129659070101, 77.53277858759418, 65.53484935042019, 29.803644291762144, 39.14346327157722, 28.103165431196167, 21.97403875208396, 23.342788025826017, 21.91999136354932, 44.0162414914299, 29.589877783311298, 30.019134050434268, 34.402373617975485, 30.337795496360716, 25.257671762518697, 25.130526268399045, 35.636910823655676, 18.168920193334568, 29.441636757689498, 27.81509260006191, 30.7761521650912, 28.200991861204457, 38.66613347911451, 29.830665565809714, 27.065440067862685, 25.530246534485183, 38.44370636150284, 33.417010973338535, 30.4660810691281, 33.25984115567298, 27.177779435316534, 33.81060583225937, 27.974652976479362, 31.167339996972512, 32.624501793790266, 32.17333721782705, 28.52130761579621, 38.381991695085816, 37.179839252699644, 17.95628539559926, 106.16864455497748, 32.43894611020188, 29.22889671117162, 34.9013478299637, 32.47835943431753, 27.85829925648979, 33.02859957615664, 24.302823532441295, 23.526620101607854, 71.74126364633078, 65.52162231236102, 27.404488380699217, 31.882690513499206, 28.085004519680524, 36.61336935080446, 35.83217057208621, 38.9185225592096, 20.921206382272974, 31.60571471272116, 24.765320346552123, 30.943708151044117, 30.371949444604944, 20.87066711240708, 26.362647732058193, 29.97360774400982, 25.718532701805877, 38.701202377467546, 31.39499825561205, 31.432376802756384, 35.666502067328224, 35.30811121444335, 35.28060445290007, 46.10008936920693, 39.56703113828843, 29.452000662171333, 33.218045985022826, 31.76244452996697, 35.43108014536665, 39.12045929404136, 48.196363031747424, 41.407048359684524, 48.50322615977618, 42.48003863011313, 33.02693620274058, 34.94735169693031, 40.39098500497335, 65.1541860616389, 32.68646861775401, 28.28799136930552, 60.36034043397423, 29.21803738454348, 32.029886867561316, 36.9119840466036, 54.78297371189433, 30.90818257812919, 30.649237271956974, 120.39253489071356, 148.44914796278678, 29.632690238934984, 35.7245567143204, 128.2880615421966, 145.90649163354826, 30.498447944672503, 99.48218299600211, 28.55287546087581, 35.082278337944004, 26.69288446034834, 29.69203304831698, 25.575624974156533, 36.537165193155545, 33.90169602362678, 35.58168852742591, 30.277301451880106, 24.21607411352338, 31.579381740957583, 27.709277575957184, 31.801054822778795, 19.519110441226243, 38.24756035980779, 34.06147324392356, 20.87815824409594, 27.153587024599315, 23.43252929944743, 18.332679244713184, 29.025196532835672, 33.49130366315466, 31.26944393521498, 27.965174801734214, 35.76330666395702, 37.14101862724903, 29.11744113775292, 21.42868459467514, 30.364217064272978, 35.70259634201239, 41.60531835082573, 27.999655826709567, 45.07059859415066, 32.1977288874276, 32.246409598389256, 30.95983902536602, 55.668970161086875, 44.63214374632007, 58.38195345078094, 33.73443718092895, 35.62966465252458, 23.76232736105519, 29.85515971259724, 30.487749932224318, 37.25973669659179, 38.75961324239045, 34.635899335916214, 34.53375018328137, 52.479605184205354, 31.136469881324548, 39.70371628489202, 25.937382679798528, 36.73992074551071, 37.76751938358611, 30.037199485570444, 30.43131882287085, 36.97601427985443, 58.73422749775392, 28.818086109856793, 25.971692543878376, 30.272120263302824, 26.48562983760213, 22.482150430603767, 46.743069264907604, 30.315142635646616, 27.125473839675752, 28.779416462790884, 47.14033616840759, 38.15701664420607, 31.963926086810957, 35.66135441701407, 32.11040038950747, 21.62360195938871, 30.762657720006544, 27.442854794301834, 33.35219340136692, 38.450607087210656, 35.4764972179422, 28.59829885194116, 40.79084360054741, 30.662299649172528, 26.49706205105885, 33.780203031193466]
list2_kurt_cells = []
for i in range(colN2) :
v_df2 = HCC1806_smarts_unfiltered[HCC1806_smarts_unfiltered.columns.tolist()[i]]
list2_kurt_cells += [kurtosis(v_df2)]
# df_kurt_cells += [df[mcf7_smarts_unfiltered.columns.tolist()[i]].kurt()]
list2_kurt_cells
sns.histplot(list2_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - original df')
Text(0.5, 0, 'Kurtosis of single cells expression profiles - original df')
As suggested by the histograms, the data is far from normal distribution with many features having high skewness and kurtosis values.
This is not good because when the data is not following a normal distribution, it might violate some assumptions of a machine learning model. Or it might just make it hard for the algorithm to detect the differences among non 0 values.
One way to make the distribution less skewed is to apply a log transformation:
var21_log2_2 = np.log2(HCC1806_smarts_unfiltered[HCC1806_smarts_unfiltered.columns.tolist()[20]]+1)
sns.boxplot(x=var21_log2_2)
<AxesSubplot:xlabel='"output.STAR.PCRPlate1B9_Normoxia_S21_Aligned.sortedByCoord.out.bam"'>
The same feature before log transformation is heavily centered around 0:
sns.boxplot(x=HCC1806_smarts_unfiltered[HCC1806_smarts_unfiltered.columns.tolist()[20]]+1)
<AxesSubplot:xlabel='"output.STAR.PCRPlate1B9_Normoxia_S21_Aligned.sortedByCoord.out.bam"'>
var21_log2_2.describe().round(2)
count 23396.00 mean 2.88 std 3.51 min 0.00 25% 0.00 50% 0.00 75% 6.17 max 15.51 Name: "output.STAR.PCRPlate1B9_Normoxia_S21_Aligned.sortedByCoord.out.bam", dtype: float64
Now lets take only the first 50 columns (for speed), and calculate skewness and kurtosis after applying log transformation to them:
df_HCC1806_50vars_log2 = (HCC1806_smarts_unfiltered.iloc[:,:50]+1).apply(np.log2)
print(df_HCC1806_50vars_log2.shape)
(23396, 50)
df_HCC1806_50vars_log2
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate1E7_Normoxia_S116_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1E8_Normoxia_S17_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F12_Normoxia_S31_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F4_Hypoxia_S106_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F5_Hypoxia_S9_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F7_Normoxia_S117_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F8_Normoxia_S18_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1F9_Normoxia_S24_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G10_Normoxia_S126_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G11_Normoxia_S25_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "WASH7P" | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 |
| "CICP27" | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.584963 | 0.000000 | 0.000000 | 0.000000 |
| "DDX11L17" | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.807355 | 0.000000 |
| "WASH9P" | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.584963 | 0.000000 | 1.000000 | 0.000000 |
| "OR4F29" | 1.584963 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "MT-TE" | 4.523562 | 5.459432 | 0.00000 | 0.000000 | 0.000000 | 2.000000 | 5.584963 | 2.321928 | 1.584963 | 3.169925 | ... | 3.700440 | 4.321928 | 4.247928 | 3.700440 | 4.169925 | 2.321928 | 4.392317 | 4.087463 | 4.584963 | 4.459432 |
| "MT-CYB" | 12.039262 | 12.664447 | 4.70044 | 12.234817 | 8.280771 | 9.442943 | 11.494856 | 10.600842 | 8.519636 | 10.837628 | ... | 10.239599 | 11.846274 | 11.282509 | 10.036174 | 11.700440 | 9.942515 | 11.382084 | 9.616549 | 12.385053 | 11.961811 |
| "MT-TT" | 4.754888 | 5.977280 | 0.00000 | 3.584963 | 2.321928 | 0.000000 | 5.392317 | 3.321928 | 1.584963 | 3.169925 | ... | 2.807355 | 4.807355 | 5.459432 | 2.000000 | 3.584963 | 1.584963 | 3.906891 | 4.857981 | 5.000000 | 4.906891 |
| "MT-TP" | 6.066089 | 6.169925 | 1.00000 | 2.000000 | 3.321928 | 3.906891 | 6.523562 | 4.523562 | 2.000000 | 4.954196 | ... | 3.459432 | 5.700440 | 5.906891 | 3.906891 | 4.754888 | 3.459432 | 4.954196 | 2.000000 | 5.930737 | 6.066089 |
| "MAFIP" | 0.000000 | 2.321928 | 0.00000 | 3.000000 | 0.000000 | 3.321928 | 0.000000 | 2.321928 | 1.584963 | 0.000000 | ... | 2.000000 | 1.584963 | 3.000000 | 1.584963 | 4.392317 | 0.000000 | 2.321928 | 4.247928 | 1.584963 | 2.000000 |
23396 rows × 50 columns
np.shape(df_HCC1806_50vars_log2)
plt.figure(figsize=(16,4))
plot=sns.violinplot(data=df_HCC1806_50vars_log2,palette="Set3",cut=0)
plt.setp(plot.get_xticklabels(), rotation=90)
[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
df_HCC1806_allVars_log2 = (HCC1806_smarts_unfiltered.iloc[:,:]+1).apply(np.log2) # log transformation of all variables
df1_log2_2_skew_cells = []
for i in range(df_HCC1806_allVars_log2.shape[1]):
v_df = df_HCC1806_allVars_log2[df_HCC1806_allVars_log2.columns.tolist()[i]]
df1_log2_2_skew_cells += [skew(v_df)]
df1_log2_2_skew_cells
sns.histplot(df1_log2_2_skew_cells,bins=100)
plt.xlabel('Skewness of single cells expression profiles - log2 df')
Text(0.5, 0, 'Skewness of single cells expression profiles - log2 df')
Now most of the variables have skewness score around 0 as we expected after applying the log transformation
df1_log2_2_kurt_cells = []
for i in range(df_HCC1806_allVars_log2.shape[1]) :
v_df = df_HCC1806_allVars_log2[df_HCC1806_allVars_log2.columns.tolist()[i]]
df1_log2_2_kurt_cells += [kurtosis(v_df)]
df1_log2_2_kurt_cells
sns.histplot(df1_log2_2_kurt_cells,bins=100)
plt.xlabel('Kurtosis of single cells expression profiles - log2 df')
Text(0.5, 0, 'Kurtosis of single cells expression profiles - log2 df')
len(df1_log2_2_kurt_cells)
243
for i in random_variable_indices:
sns.displot(
df_HCC1806_allVars_log2,
x= df_HCC1806_allVars_log2.columns.tolist()[i],
kind="kde"
)
If we compare the density plots with the ones above before the transformation, we see them changed, the distributions become more bimodal.
df_HCC1806_allVars_log2_small = df_HCC1806_allVars_log2.iloc[:, 10:30] #just selecting part of the samples so run time not too long
sns.displot(data=df_HCC1806_allVars_log2_small,palette="Set3",kind="kde", bw_adjust=2)
<seaborn.axisgrid.FacetGrid at 0x7fda5ea1e970>
df_HCC1806_allVars_log2_small.describe()
| "output.STAR.PCRPlate1A9_Normoxia_S20_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B11_Normoxia_S127_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B12_Normoxia_S27_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B1_Hypoxia_S98_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B2_Hypoxia_S1_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B3_Hypoxia_S5_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B4_Hypoxia_S105_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B5_Hypoxia_S109_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B6_Hypoxia_S12_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B7_Normoxia_S114_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1B9_Normoxia_S21_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C10_Normoxia_S124_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C11_Normoxia_S128_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C12_Normoxia_S28_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C1_Hypoxia_S99_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C5_Hypoxia_S110_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C6_Hypoxia_S13_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C7_Normoxia_S115_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C8_Normoxia_S120_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1C9_Normoxia_S22_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 | 23396.000000 |
| mean | 2.843058 | 3.048954 | 3.252325 | 0.007102 | 2.159249 | 1.943660 | 1.281605 | 1.934244 | 1.936625 | 3.027089 | 2.884122 | 2.746226 | 2.916634 | 2.203049 | 2.107904 | 2.720679 | 2.747089 | 2.899276 | 2.721376 | 3.141384 |
| std | 3.254221 | 3.758881 | 3.831877 | 0.124829 | 3.295504 | 2.852602 | 1.931464 | 2.822692 | 2.608714 | 3.681674 | 3.512653 | 3.426206 | 3.542763 | 2.895482 | 3.419832 | 3.724526 | 3.588938 | 3.429424 | 3.289676 | 3.771466 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 5.857981 | 6.794416 | 6.918863 | 0.000000 | 5.044394 | 4.459432 | 2.321928 | 4.392317 | 4.087463 | 6.658211 | 6.169925 | 5.954196 | 6.285402 | 4.807355 | 4.523562 | 6.658211 | 6.285402 | 6.108524 | 5.727920 | 6.870365 |
| max | 15.368745 | 15.108851 | 16.678545 | 7.139551 | 15.326991 | 13.895575 | 11.000000 | 13.889694 | 13.751021 | 15.366220 | 15.506246 | 15.081026 | 15.439247 | 14.738778 | 15.732167 | 15.659048 | 15.517700 | 14.784328 | 14.746304 | 15.439669 |
from sklearn.preprocessing import Normalizer
transformer = Normalizer().fit(df_HCC1806_allVars_log2)
df_HCC1806_allVars_log2_norm = pd.DataFrame(
transformer.transform(df_HCC1806_allVars_log2),
columns=df_HCC1806_allVars_log2.columns
)
df_HCC1806_allVars_log2_norm.describe().round(2)
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | ... | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 | 23396.00 |
| mean | 0.04 | 0.05 | 0.02 | 0.03 | 0.03 | 0.03 | 0.04 | 0.04 | 0.03 | 0.02 | ... | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.05 | 0.03 | 0.04 | 0.03 | 0.04 |
| std | 0.05 | 0.06 | 0.03 | 0.06 | 0.04 | 0.05 | 0.06 | 0.05 | 0.04 | 0.04 | ... | 0.05 | 0.05 | 0.05 | 0.03 | 0.04 | 0.06 | 0.04 | 0.06 | 0.04 | 0.05 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 | 0.00 | 0.03 |
| 75% | 0.07 | 0.08 | 0.03 | 0.07 | 0.06 | 0.05 | 0.07 | 0.08 | 0.06 | 0.05 | ... | 0.07 | 0.07 | 0.06 | 0.05 | 0.05 | 0.08 | 0.06 | 0.07 | 0.06 | 0.07 |
| max | 0.93 | 0.93 | 0.71 | 0.96 | 0.98 | 0.97 | 0.97 | 0.79 | 0.86 | 0.96 | ... | 0.97 | 0.89 | 0.89 | 0.85 | 0.96 | 0.94 | 0.93 | 0.93 | 0.90 | 0.96 |
8 rows × 243 columns
for i in random_variable_indices:
sns.displot(
df_HCC1806_allVars_log2,
x= df_HCC1806_allVars_log2_norm.columns.tolist()[i],
kind="kde"
)
I am loading the filtered and filtered+normalized datasets to make a comparison as requested:
HCC1806_smarts_filtered = pd.read_csv("SmartSeq/HCC1806_SmartS_Filtered_Data.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_smarts_filtered))
print("First column: ", HCC1806_smarts_filtered.iloc[ : , 0])
HCC1806_smarts_filtered_normalized = pd.read_csv("SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_smarts_filtered_normalized))
print("First column: ", HCC1806_smarts_filtered_normalized.iloc[ : , 0])
Dataframe dimensions: (19503, 227)
First column: "CICP27" 0
"DDX11L17" 0
"WASH9P" 0
"OR4F29" 2
"MTND1P23" 250
...
"MT-TE" 22
"MT-CYB" 4208
"MT-TT" 26
"MT-TP" 66
"MAFIP" 0
Name: "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam", Length: 19503, dtype: int64
Dataframe dimensions: (3000, 182)
First column: "DDIT4" 0
"ANGPTL4" 48
"CALML5" 0
"KRT14" 321
"CCNB1" 298
...
"LINC02693" 29
"OR8B9P" 0
"NEAT1" 29
"ZDHHC23" 0
"ODAD2" 0
Name: "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam", Length: 3000, dtype: int64
for i in random_variable_indices:
if i<HCC1806_smarts_filtered.shape[1]:
sns.displot(
HCC1806_smarts_filtered ,
x= HCC1806_smarts_filtered.columns.tolist()[i],
kind="kde"
)
The variables in the filtered dataset seem very much like the ones in the unfiltered dataset before log normalization.
for i in random_variable_indices:
if i<HCC1806_smarts_filtered_normalized.shape[1]:
sns.displot(
HCC1806_smarts_filtered_normalized,
x= HCC1806_smarts_filtered_normalized.columns.tolist()[i],
kind="kde"
)
colN2_filtered_normalized = HCC1806_smarts_filtered_normalized.shape[1]
colN2_filtered_normalized
list2_skew_cells_filtered_normalized = []
for i in range(colN2_filtered_normalized) :
v_df = HCC1806_smarts_filtered_normalized[HCC1806_smarts_filtered_normalized.columns.tolist()[i]]
list2_skew_cells_filtered_normalized += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
list2_skew_cells_filtered_normalized
sns.histplot(list2_skew_cells_filtered_normalized,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
Text(0.5, 0, 'Skewness of single cells expression profiles - original df')
colN2_filtered = HCC1806_smarts_filtered.shape[1]
colN2_filtered
list2_skew_cells_filtered = []
for i in range(colN2_filtered) :
v_df = HCC1806_smarts_filtered[HCC1806_smarts_filtered.columns.tolist()[i]]
list2_skew_cells_filtered += [skew(v_df)]
# df_skew_cells += [df[cnames[i]].skew()]
list2_skew_cells_filtered
sns.histplot(list2_skew_cells_filtered,bins=100)
plt.xlabel('Skewness of single cells expression profiles - original df')
Text(0.5, 0, 'Skewness of single cells expression profiles - original df')
HCC1806_smarts_filtered[['"output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam"']].describe().round(2)
| "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam" | |
|---|---|
| count | 19503.00 |
| mean | 162.00 |
| std | 839.37 |
| min | 0.00 |
| 25% | 0.00 |
| 50% | 6.00 |
| 75% | 114.00 |
| max | 58205.00 |
set(HCC1806_smarts_filtered_normalized.columns.tolist()).intersection(set(HCC1806_smarts_filtered.columns.tolist()))
{'"output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G1_Hypoxia_S102_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G2_Hypoxia_S2_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G3_Hypoxia_S7_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G4_Hypoxia_S107_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G7_Normoxia_S118_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G8_Normoxia_S19_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1G9_Normoxia_S121_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1H1_Hypoxia_S103_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1H2_Hypoxia_S3_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1H5_Hypoxia_S10_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1H6_Hypoxia_S16_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate1H9_Normoxia_S122_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A10_Normoxia_S153_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A1_Hypoxia_S129_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A3_Hypoxia_S36_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A4_Hypoxia_S138_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A6_Hypoxia_S44_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A8_Normoxia_S151_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2A9_Normoxia_S53_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B11_Normoxia_S159_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B1_Hypoxia_S130_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B2_Hypoxia_S135_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B3_Hypoxia_S37_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B4_Hypoxia_S139_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B6_Hypoxia_S45_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2B8_Normoxia_S152_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C10_Normoxia_S154_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C11_Normoxia_S160_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C12_Normoxia_S59_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C1_Hypoxia_S131_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C3_Hypoxia_S38_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C4_Hypoxia_S140_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C5_Hypoxia_S144_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C6_Hypoxia_S46_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C7_Normoxia_S147_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2C8_Normoxia_S49_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D12_Normoxia_S60_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D1_Hypoxia_S132_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D2_Hypoxia_S136_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D5_Hypoxia_S41_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D6_Hypoxia_S47_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2D9_Normoxia_S54_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E10_Normoxia_S155_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E11_Normoxia_S57_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E3_Hypoxia_S39_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E4_Hypoxia_S141_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E5_Hypoxia_S42_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E6_Hypoxia_S48_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E7_Normoxia_S148_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E8_Normoxia_S50_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2E9_Normoxia_S55_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F10_Normoxia_S156_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F1_Hypoxia_S133_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F2_Hypoxia_S33_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F3_Hypoxia_S40_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F4_Hypoxia_S142_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F6_Hypoxia_S145_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F7_Normoxia_S149_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2F8_Normoxia_S51_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G12_Normoxia_S63_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G2_Hypoxia_S34_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G4_Hypoxia_S143_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G7_Normoxia_S150_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G8_Normoxia_S52_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2G9_Normoxia_S56_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H10_Normoxia_S158_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H11_Normoxia_S58_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H1_Hypoxia_S134_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H3_Hypoxia_S137_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H5_Hypoxia_S43_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate2H6_Hypoxia_S146_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A10_Normoxia_S186_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A11_Normoxia_S89_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A12_Normoxia_S94_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A2_Hypoxia_S166_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A3_Hypoxia_S69_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A5_Hypoxia_S75_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A6_Hypoxia_S177_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3A9_Normoxia_S83_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B11_Normoxia_S90_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B1_Hypoxia_S64_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B3_Hypoxia_S70_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B4_Hypoxia_S173_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B5_Hypoxia_S76_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B6_Hypoxia_S178_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B7_Normoxia_S183_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B8_Normoxia_S82_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3B9_Normoxia_S84_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C10_Normoxia_S187_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C11_Normoxia_S91_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C12_Normoxia_S95_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C2_Hypoxia_S167_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C3_Hypoxia_S71_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C5_Hypoxia_S77_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C6_Hypoxia_S179_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3C9_Normoxia_S85_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3D12_Normoxia_S96_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3D1_Hypoxia_S161_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3D4_Hypoxia_S174_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3D9_Normoxia_S86_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E10_Normoxia_S189_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E12_Normoxia_S217_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E1_Hypoxia_S162_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E2_Hypoxia_S65_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E3_Hypoxia_S169_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E4_Hypoxia_S175_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E5_Hypoxia_S79_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3E9_Normoxia_S87_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F10_Normoxia_S190_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F12_Normoxia_S218_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F1_Hypoxia_S163_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F2_Hypoxia_S66_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F3_Hypoxia_S170_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F4_Hypoxia_S176_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F6_Hypoxia_S180_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3F9_Normoxia_S88_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G10_Normoxia_S191_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G11_Normoxia_S93_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G12_Normoxia_S219_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G1_Hypoxia_S164_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G2_Hypoxia_S67_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G3_Hypoxia_S171_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G4_Hypoxia_S73_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G5_Hypoxia_S80_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G6_Hypoxia_S181_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G7_Normoxia_S184_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3G9_Normoxia_S185_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H10_Normoxia_S192_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H1_Hypoxia_S165_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H2_Hypoxia_S68_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H3_Hypoxia_S172_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H4_Hypoxia_S74_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H6_Hypoxia_S182_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate3H7_Normoxia_S81_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A10_Normoxia_S237_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A1_Hypoxia_S220_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A3_Hypoxia_S200_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A7_Normoxia_S201_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A8_Normoxia_S206_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4A9_Normoxia_S234_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B10_Normoxia_S238_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B11_Normoxia_S211_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B12_Normoxia_S215_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B2_Hypoxia_S194_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B3_Hypoxia_S225_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B6_Hypoxia_S230_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4B8_Normoxia_S207_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C10_Normoxia_S239_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C11_Normoxia_S212_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C12_Normoxia_S216_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C1_Hypoxia_S222_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C2_Hypoxia_S195_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C3_Hypoxia_S226_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C6_Hypoxia_S231_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C7_Normoxia_S202_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4C8_Normoxia_S208_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E11_Normoxia_S213_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E12_Normoxia_S241_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E1_Hypoxia_S223_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E2_Hypoxia_S196_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E3_Hypoxia_S227_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4E8_Normoxia_S233_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F10_Normoxia_S240_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F12_Normoxia_S242_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F1_Hypoxia_S224_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F2_Hypoxia_S197_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F4_Hypoxia_S228_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F5_Hypoxia_S229_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F7_Normoxia_S203_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4F9_Normoxia_S235_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G10_Normoxia_S209_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam"',
'"output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam"'}
HCC1806_smarts_filtered_normalized[['"output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam"']].describe().round(2)
| "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam" | |
|---|---|
| count | 3000.00 |
| mean | 149.35 |
| std | 1052.55 |
| min | 0.00 |
| 25% | 0.00 |
| 50% | 0.00 |
| 75% | 5.25 |
| max | 39148.00 |
As suggested, we move on with checking for duplicate rows.
duplicate_rows_df_HCC1806_allVars_log2 = df_HCC1806_allVars_log2[df_HCC1806_allVars_log2.duplicated(keep=False)]
print("number of duplicate rows: ", duplicate_rows_df_HCC1806_allVars_log2.shape)
print("number of duplicate rows: ", duplicate_rows_df_HCC1806_allVars_log2)
number of duplicate rows: (89, 243)
number of duplicate rows: "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
... \
"MMP23A" ...
"LINC01647" ...
"LINC01361" ...
"ITGA10" ...
"RORC" ...
... ...
"ENPP7" ...
"DTNA" ...
"ALPK2" ...
"OR7G2" ...
"PLVAP" ...
"output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" \
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
"output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam"
"MMP23A" 0.0
"LINC01647" 0.0
"LINC01361" 0.0
"ITGA10" 0.0
"RORC" 0.0
... ...
"ENPP7" 0.0
"DTNA" 0.0
"ALPK2" 0.0
"OR7G2" 0.0
"PLVAP" 0.0
[89 rows x 243 columns]
To understand which genes convey the same information, we can check their correlations.
#print("names of duplicate rows: ",duplicate_rows_df.index)
duplicate_rows_df_HCC1806_allVars_log2_t = duplicate_rows_df_HCC1806_allVars_log2.T
duplicate_rows_df_HCC1806_allVars_log2_t
c_dupl = duplicate_rows_df_HCC1806_allVars_log2_t.corr()
c_dupl
| "MMP23A" | "LINC01647" | "LINC01361" | "ITGA10" | "RORC" | "GPA33" | "OR2M4" | "LINC01247" | "SNORD92" | "LINC01106" | ... | "MSX2P1" | "MIR548D2" | "MIR548AA2" | "KCNJ16" | "CD300A" | "ENPP7" | "DTNA" | "ALPK2" | "OR7G2" | "PLVAP" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "MMP23A" | 1.000000 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | -0.008299 | -0.008299 | -0.008299 | -0.008299 | -0.007355 | -0.008299 | -0.008299 |
| "LINC01647" | -0.008299 | 1.000000 | 0.495851 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | 0.306913 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 0.495851 | 0.495851 | 0.495851 | 0.886297 | 0.495851 | 0.495851 |
| "LINC01361" | -0.008299 | 0.495851 | 1.000000 | 1.000000 | 0.495851 | 1.000000 | 0.495851 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 1.000000 | 0.495851 | 1.000000 | 0.206954 | 0.495851 | 1.000000 |
| "ITGA10" | -0.008299 | 0.495851 | 1.000000 | 1.000000 | 0.495851 | 1.000000 | 0.495851 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 1.000000 | 0.495851 | 1.000000 | 0.206954 | 0.495851 | 1.000000 |
| "RORC" | -0.008299 | -0.008299 | 0.495851 | 0.495851 | 1.000000 | 0.495851 | 1.000000 | -0.008299 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | -0.008299 | 0.495851 | -0.008299 | 0.495851 | -0.007355 | -0.008299 | 0.495851 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "ENPP7" | -0.008299 | 0.495851 | 0.495851 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 0.495851 | 1.000000 | 0.495851 | 0.206954 | 0.495851 | 0.495851 |
| "DTNA" | -0.008299 | 0.495851 | 1.000000 | 1.000000 | 0.495851 | 1.000000 | 0.495851 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 1.000000 | 0.495851 | 1.000000 | 0.206954 | 0.495851 | 1.000000 |
| "ALPK2" | -0.007355 | 0.886297 | 0.206954 | 0.206954 | -0.007355 | 0.206954 | -0.007355 | 0.206954 | -0.007355 | 0.418713 | ... | -0.007355 | -0.008715 | -0.008715 | 0.206954 | 0.206954 | 0.206954 | 0.206954 | 1.000000 | 0.206954 | 0.206954 |
| "OR7G2" | -0.008299 | 0.495851 | 0.495851 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 0.495851 | 0.495851 | 0.495851 | 0.206954 | 1.000000 | 0.495851 |
| "PLVAP" | -0.008299 | 0.495851 | 1.000000 | 1.000000 | 0.495851 | 1.000000 | 0.495851 | 0.495851 | -0.008299 | -0.011157 | ... | -0.008299 | -0.009833 | -0.009833 | 0.495851 | 1.000000 | 0.495851 | 1.000000 | 0.206954 | 0.495851 | 1.000000 |
89 rows × 89 columns
We create the dataset without duplicates
df_HCC1806_allVars_log2_noDup = df_HCC1806_allVars_log2.drop_duplicates()
#df_noDup
100*(len(df_HCC1806_allVars_log2)- len(df_HCC1806_allVars_log2_noDup))/len(df_HCC1806_allVars_log2)
0.2308086852453411
We removed less than 1% of the dataset
We are investigating the correlations between the samples:
plt.figure(figsize=(10,5))
#df_small = df.iloc[:, :50]
#c= df_small.corr()
c2= df_HCC1806_allVars_log2_noDup.corr()
midpoint2 = (c2.values.max() - c2.values.min()) /2 + c2.values.min()
#sns.heatmap(c,cmap='coolwarm',annot=True, center=midpoint )
sns.heatmap(c2,cmap='coolwarm', center=0 )
print("Number of cells included: ", np.shape(c2))
print("Average correlation of expression profiles between cells: ", midpoint2)
print("Min. correlation of expression profiles between cells: ", c2.values.min())
Number of cells included: (243, 243) Average correlation of expression profiles between cells: 0.5015487484881286 Min. correlation of expression profiles between cells: 0.003097496976257147
We see that the correlation matrix of cells contains high values and therefore is mostly red. There are some white striped that indicate the presence of cells that are not correlated with other cells.
For each cell we calculate how many low correlated cells there are. For low correlation we defined the correlation threshold as a range between +/- 0.2:
df_lowCorr_info2 = c2[(c2 < 0.2) & (c2>-0.2)].count().reset_index().rename(columns={'index':'cell', 0:'n_lowCorr_cells'})
df_lowCorr_info2
| cell | n_lowCorr_cells | |
|---|---|---|
| 0 | "output.STAR.PCRPlate1A10_Normoxia_S123_Aligne... | 8 |
| 1 | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned... | 8 |
| 2 | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.s... | 8 |
| 3 | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.... | 8 |
| 4 | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.so... | 8 |
| ... | ... | ... |
| 238 | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligne... | 8 |
| 239 | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligne... | 8 |
| 240 | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.... | 8 |
| 241 | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned... | 8 |
| 242 | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned... | 8 |
243 rows × 2 columns
Let's define the 'uncorrelated cell group' as the group of cells that have low correlation with at least half of the other cells:
df_lowCorr_info2[df_lowCorr_info2['n_lowCorr_cells']> 0]
| cell | n_lowCorr_cells | |
|---|---|---|
| 0 | "output.STAR.PCRPlate1A10_Normoxia_S123_Aligne... | 8 |
| 1 | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned... | 8 |
| 2 | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.s... | 8 |
| 3 | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.... | 8 |
| 4 | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.so... | 8 |
| ... | ... | ... |
| 238 | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligne... | 8 |
| 239 | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligne... | 8 |
| 240 | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.... | 8 |
| 241 | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned... | 8 |
| 242 | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned... | 8 |
243 rows × 2 columns
df_lowCorr_info2.groupby(['n_lowCorr_cells']).agg('count')
| cell | |
|---|---|
| n_lowCorr_cells | |
| 8 | 235 |
| 242 | 8 |
half_of_samples = len(df_lowCorr_info2)/2
df_lowCorr_info2[df_lowCorr_info2['n_lowCorr_cells']> half_of_samples]
| cell | n_lowCorr_cells | |
|---|---|---|
| 13 | "output.STAR.PCRPlate1B1_Hypoxia_S98_Aligned.s... | 242 |
| 96 | "output.STAR.PCRPlate2E12_Normoxia_S61_Aligned... | 242 |
| 105 | "output.STAR.PCRPlate2F12_Normoxia_S62_Aligned... | 242 |
| 152 | "output.STAR.PCRPlate3D10_Normoxia_S188_Aligne... | 242 |
| 153 | "output.STAR.PCRPlate3D11_Normoxia_S92_Aligned... | 242 |
| 156 | "output.STAR.PCRPlate3D2_Hypoxia_S168_Aligned.... | 242 |
| 157 | "output.STAR.PCRPlate3D3_Hypoxia_S72_Aligned.s... | 242 |
| 159 | "output.STAR.PCRPlate3D5_Hypoxia_S78_Aligned.s... | 242 |
print(len(df_lowCorr_info2[df_lowCorr_info2['n_lowCorr_cells']> half_of_samples]))
8
8 cells are expressing very different gene profiles (i.e. their correlations with almost all of the cells is between -0.2 and 0.2)
uncorrelated_cells2 = df_lowCorr_info2.cell.tolist()
df_HCC1806_allVars_log2_noDup[uncorrelated_cells2].describe(percentiles=[0.05,0.25,0.5,0.75,0.95]).round(2)
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | ... | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 | 23342.00 |
| mean | 2.67 | 3.23 | 1.37 | 2.26 | 2.11 | 1.83 | 2.64 | 2.89 | 2.03 | 1.88 | ... | 2.83 | 2.80 | 2.24 | 2.03 | 2.05 | 3.12 | 2.26 | 2.81 | 2.17 | 2.77 |
| std | 3.34 | 3.85 | 1.94 | 3.62 | 2.76 | 2.81 | 3.65 | 3.55 | 2.81 | 2.83 | ... | 3.25 | 3.44 | 2.97 | 2.61 | 2.65 | 3.41 | 2.76 | 3.45 | 2.78 | 3.17 |
| min | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 5% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 25% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 50% | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.58 | 0.00 | 0.00 | 0.00 | 1.00 |
| 75% | 5.73 | 6.99 | 2.58 | 5.36 | 4.52 | 4.17 | 6.38 | 6.27 | 4.52 | 4.25 | ... | 5.86 | 6.09 | 4.95 | 4.25 | 4.32 | 6.27 | 4.64 | 6.11 | 4.58 | 5.52 |
| 95% | 8.73 | 9.85 | 5.32 | 9.46 | 7.27 | 7.43 | 9.34 | 9.14 | 7.41 | 7.51 | ... | 8.42 | 8.85 | 7.77 | 6.91 | 7.03 | 8.85 | 7.29 | 8.88 | 7.38 | 8.43 |
| max | 15.11 | 16.08 | 12.63 | 16.10 | 14.08 | 15.53 | 15.39 | 15.94 | 14.88 | 15.11 | ... | 14.26 | 14.92 | 14.42 | 13.72 | 13.49 | 15.03 | 13.91 | 15.07 | 13.41 | 14.80 |
10 rows × 243 columns
Low correlated group of cells are having 0 in at least half of their data points with high enough standard deviations.
We can also look at the cells that are highly correlated with other cells in the same way. We define the high correlation threshold as values greater than 0.75 and less than -0.75:
df_highCorr_info2 = c2[(c2 < -0.75) | (c2> 0.75)].count().reset_index().rename(columns={'index':'cell', 0:'n_highCorr_cells'})
print(len(df_highCorr_info2[df_highCorr_info2['n_highCorr_cells']> half_of_samples]))
197
100*(197/243)
81.06995884773663
81% of the cells are highly correlated with at least half of the other cells.
df_highCorr_info2
| cell | n_highCorr_cells | |
|---|---|---|
| 0 | "output.STAR.PCRPlate1A10_Normoxia_S123_Aligne... | 199 |
| 1 | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned... | 197 |
| 2 | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.s... | 175 |
| 3 | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.... | 1 |
| 4 | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.so... | 187 |
| ... | ... | ... |
| 238 | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligne... | 204 |
| 239 | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligne... | 193 |
| 240 | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.... | 195 |
| 241 | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned... | 208 |
| 242 | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned... | 215 |
243 rows × 2 columns
df_highCorr_info2[df_highCorr_info2['n_highCorr_cells']> half_of_samples]
| cell | n_highCorr_cells | |
|---|---|---|
| 0 | "output.STAR.PCRPlate1A10_Normoxia_S123_Aligne... | 199 |
| 1 | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned... | 197 |
| 2 | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.s... | 175 |
| 4 | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.so... | 187 |
| 7 | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.s... | 197 |
| ... | ... | ... |
| 238 | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligne... | 204 |
| 239 | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligne... | 193 |
| 240 | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.... | 195 |
| 241 | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned... | 208 |
| 242 | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned... | 215 |
197 rows × 2 columns
These cells above are correlated with more than half of the cells.
c2
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | 1.000000 | 0.802084 | 0.738143 | 0.664497 | 0.771012 | 0.642613 | 0.747360 | 0.780823 | 0.786115 | 0.754578 | ... | 0.802027 | 0.793574 | 0.762539 | 0.796645 | 0.804307 | 0.810706 | 0.784340 | 0.777817 | 0.811515 | 0.839984 |
| "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | 0.802084 | 1.000000 | 0.753342 | 0.664647 | 0.772079 | 0.655250 | 0.742407 | 0.798381 | 0.774553 | 0.725916 | ... | 0.803735 | 0.794973 | 0.760604 | 0.812706 | 0.810450 | 0.819417 | 0.790652 | 0.782426 | 0.807248 | 0.827735 |
| "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | 0.738143 | 0.753342 | 1.000000 | 0.684016 | 0.807379 | 0.673347 | 0.712961 | 0.798955 | 0.766923 | 0.680903 | ... | 0.765429 | 0.773779 | 0.774988 | 0.818070 | 0.810438 | 0.797074 | 0.804671 | 0.780673 | 0.805295 | 0.789405 |
| "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | 0.664497 | 0.664647 | 0.684016 | 1.000000 | 0.674532 | 0.604009 | 0.650589 | 0.677262 | 0.659459 | 0.606316 | ... | 0.654940 | 0.681581 | 0.689495 | 0.707229 | 0.678522 | 0.679771 | 0.670490 | 0.693045 | 0.693684 | 0.680108 |
| "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | 0.771012 | 0.772079 | 0.807379 | 0.674532 | 1.000000 | 0.667336 | 0.724906 | 0.796727 | 0.777153 | 0.704081 | ... | 0.788413 | 0.786846 | 0.774041 | 0.814148 | 0.813060 | 0.810543 | 0.793323 | 0.785342 | 0.810927 | 0.812078 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | 0.810706 | 0.819417 | 0.797074 | 0.679771 | 0.810543 | 0.663354 | 0.745373 | 0.823011 | 0.794964 | 0.719803 | ... | 0.840134 | 0.816209 | 0.787772 | 0.838233 | 0.843397 | 1.000000 | 0.836118 | 0.819319 | 0.848202 | 0.853989 |
| "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | 0.784340 | 0.790652 | 0.804671 | 0.670490 | 0.793323 | 0.650784 | 0.719596 | 0.793647 | 0.778771 | 0.689864 | ... | 0.804996 | 0.788981 | 0.766460 | 0.816875 | 0.825891 | 0.836118 | 1.000000 | 0.807582 | 0.835916 | 0.820247 |
| "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | 0.777817 | 0.782426 | 0.780673 | 0.693045 | 0.785342 | 0.661865 | 0.733644 | 0.789238 | 0.764122 | 0.690615 | ... | 0.790921 | 0.792243 | 0.780626 | 0.810612 | 0.801519 | 0.819319 | 0.807582 | 1.000000 | 0.822911 | 0.804450 |
| "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | 0.811515 | 0.807248 | 0.805295 | 0.693684 | 0.810927 | 0.674750 | 0.749507 | 0.814138 | 0.807045 | 0.732905 | ... | 0.830455 | 0.815674 | 0.794243 | 0.838682 | 0.848034 | 0.848202 | 0.835916 | 0.822911 | 1.000000 | 0.844897 |
| "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | 0.839984 | 0.827735 | 0.789405 | 0.680108 | 0.812078 | 0.671312 | 0.762097 | 0.816603 | 0.815918 | 0.759474 | ... | 0.841578 | 0.821398 | 0.795487 | 0.829508 | 0.847031 | 0.853989 | 0.820247 | 0.804450 | 0.844897 | 1.000000 |
243 rows × 243 columns
So far we looked at the correlation between different samples (cells) and then we tried to understand if there is a cluster of cells that are different than the rest and if there is a cluster of cells that are very similar to each other using correlation measure.
Now instead let's look at the correlation between two groups of low and high Hypoxia cells:
hypo_cells2 = [elem for elem in df_HCC1806_allVars_log2_noDup.columns.tolist() if 'Hypo' in elem ]
df_corr_hypo_cells2 = c2[c2.index.isin(hypo_cells2)]
df_corr_hypo_cells2 = df_corr_hypo_cells2[hypo_cells2]
midpoint_hypo2 = (df_corr_hypo_cells2.values.max() - df_corr_hypo_cells2.values.min()) /2 + df_corr_hypo_cells2.values.min()
print("Number of cells included: ", np.shape(df_corr_hypo_cells2))
print("Average correlation of expression profiles between hypoxia cells: ", midpoint_hypo2)
Number of cells included: (126, 126) Average correlation of expression profiles between hypoxia cells: 0.5138416720932725
lower_matrix = df_corr_hypo_cells2.mask(np.triu(np.ones(df_corr_hypo_cells2.shape, dtype=np.bool_)))
print(np.nanmean(lower_matrix))
print(np.nanstd(lower_matrix))
0.7125203020299888 0.18110061760024465
Let's look at the correlation between Normal cells:
no_hypo_cells2 = [elem for elem in df_HCC1806_allVars_log2_noDup.columns.tolist() if 'Hypo' not in elem ]
df_corr_nohypo_cells2 = c2[c2.index.isin(no_hypo_cells2)]
df_corr_nohypo_cells2 = df_corr_nohypo_cells2[no_hypo_cells2]
midpoint_nohypo2 = (df_corr_nohypo_cells2.values.max() - df_corr_nohypo_cells2.values.min()) /2 + df_corr_nohypo_cells2.values.min()
print("Number of cells included: ", np.shape(df_corr_nohypo_cells2))
print("Average correlation of expression profiles between hypoxia cells: ", midpoint_nohypo2)
#df_mcf7_allVars_log2_noDup.corr()
Number of cells included: (117, 117) Average correlation of expression profiles between hypoxia cells: 0.5025598128723647
lower_matrix_nohypo = df_corr_nohypo_cells2.mask(np.triu(np.ones(df_corr_nohypo_cells2.shape, dtype=np.bool_)))
print(np.nanmean(lower_matrix_nohypo))
print(np.nanstd(lower_matrix_nohypo))
0.7391483781373077 0.2012335333797111
The average correlation within the two cell groups (low oxygen condition cells and high oxygen condition) is similar.
That means high oxygen cells are not more similar to each other than how much similar low oxygen cells to each other.
We choose 5 random cells from high oxygen condition and then 5 random cells from low oxygen condition and look at their distributions:
len(df_corr_nohypo_cells2)
117
random.seed(1111)
random_vars = [randint(0,len(df_corr_nohypo_cells2)) for i in range(0,5)]
sns.histplot(df_corr_nohypo_cells2.iloc[:,random_vars],bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')
Text(0.5, 0, 'Correlation between cells expression profiles')
random_vars = [randint(0,len(df_corr_hypo_cells2)) for i in range(0,5)]
sns.histplot(df_corr_hypo_cells2.iloc[:,random_vars],bins=100)
plt.ylabel('Frequency')
plt.xlabel('Correlation between cells expression profiles')
Text(0.5, 0, 'Correlation between cells expression profiles')
Both for no hypoxia and hypoxia conditions, we chose 4 random cells. The cells visualized how high correlations with other cells as visualized.
We also check the correlations between the features (i.e. the expressions of different genes) as requested. It take too long to check all the features, therefore we use only 5% of the features for this exercise:
df_HCC1806_allVars_log2_noDup.iloc[:,0:20].T
| "WASH7P" | "CICP27" | "DDX11L17" | "WASH9P" | "OR4F29" | "MTND1P23" | "MTND2P28" | "MTCO1P12" | "MTCO2P12" | "MTATP8P1" | ... | "MT-TH" | "MT-TS2" | "MT-TL2" | "MT-ND5" | "MT-ND6" | "MT-TE" | "MT-CYB" | "MT-TT" | "MT-TP" | "MAFIP" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.PCRPlate1A10_Normoxia_S123_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 1.584963 | 7.971544 | 5.781360 | 10.765700 | 2.807355 | 1.0 | ... | 4.169925 | 2.584963 | 4.000000 | 11.911766 | 9.815383 | 4.523562 | 12.039262 | 4.754888 | 6.066089 | 0.000000 |
| "output.STAR.PCRPlate1A12_Normoxia_S26_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 8.731319 | 6.658211 | 11.192909 | 2.584963 | 1.0 | ... | 5.643856 | 4.906891 | 5.209453 | 12.864573 | 10.491853 | 5.459432 | 12.664447 | 5.977280 | 6.169925 | 2.321928 |
| "output.STAR.PCRPlate1A1_Hypoxia_S97_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 3.584963 | 0.000000 | 4.523562 | 0.000000 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 6.554589 | 3.700440 | 0.000000 | 4.700440 | 0.000000 | 1.000000 | 0.000000 |
| "output.STAR.PCRPlate1A2_Hypoxia_S104_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 6.000000 | 4.087463 | 9.995767 | 2.000000 | 0.0 | ... | 5.459432 | 4.169925 | 3.169925 | 10.531381 | 7.876517 | 0.000000 | 12.234817 | 3.584963 | 2.000000 | 3.000000 |
| "output.STAR.PCRPlate1A3_Hypoxia_S4_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 4.807355 | 2.000000 | 8.247928 | 0.000000 | 0.0 | ... | 0.000000 | 0.000000 | 2.000000 | 8.247928 | 5.087463 | 0.000000 | 8.280771 | 2.321928 | 3.321928 | 0.000000 |
| "output.STAR.PCRPlate1A4_Hypoxia_S8_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.000000 | 6.357552 | 1.000000 | 9.199672 | 0.000000 | 0.0 | ... | 1.000000 | 0.000000 | 0.000000 | 9.346514 | 6.321928 | 2.000000 | 9.442943 | 0.000000 | 3.906891 | 3.321928 |
| "output.STAR.PCRPlate1A5_Hypoxia_S108_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 1.000000 | 8.257388 | 6.149747 | 11.329236 | 1.000000 | 2.0 | ... | 4.857981 | 3.906891 | 5.000000 | 12.673751 | 10.324181 | 5.584963 | 11.494856 | 5.392317 | 6.523562 | 0.000000 |
| "output.STAR.PCRPlate1A6_Hypoxia_S11_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 6.375039 | 3.169925 | 9.972980 | 2.584963 | 0.0 | ... | 1.000000 | 1.000000 | 1.584963 | 10.312883 | 7.971544 | 2.321928 | 10.600842 | 3.321928 | 4.523562 | 2.321928 |
| "output.STAR.PCRPlate1A7_Normoxia_S113_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 1.0 | 0.000000 | 5.554589 | 2.584963 | 8.794416 | 0.000000 | 0.0 | ... | 1.584963 | 1.000000 | 2.321928 | 9.204571 | 6.044394 | 1.584963 | 8.519636 | 1.584963 | 2.000000 | 1.584963 |
| "output.STAR.PCRPlate1A8_Normoxia_S119_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 7.475733 | 3.459432 | 9.607330 | 1.584963 | 1.0 | ... | 3.700440 | 2.584963 | 2.000000 | 10.621136 | 8.375039 | 3.169925 | 10.837628 | 3.169925 | 4.954196 | 0.000000 |
| "output.STAR.PCRPlate1A9_Normoxia_S20_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.000000 | 6.845490 | 5.209453 | 9.724514 | 0.000000 | 1.0 | ... | 3.459432 | 0.000000 | 4.087463 | 11.512247 | 9.296916 | 4.169925 | 11.726218 | 3.321928 | 5.247928 | 0.000000 |
| "output.STAR.PCRPlate1B11_Normoxia_S127_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.000000 | 0.0 | 0.000000 | 7.531381 | 6.189825 | 10.918118 | 1.584963 | 1.0 | ... | 4.459432 | 3.906891 | 4.459432 | 12.055621 | 9.398744 | 4.169925 | 12.795228 | 4.700440 | 6.475733 | 2.000000 |
| "output.STAR.PCRPlate1B12_Normoxia_S27_Aligned.sortedByCoord.out.bam" | 1.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 7.523562 | 3.700440 | 10.071462 | 0.000000 | 0.0 | ... | 3.169925 | 3.459432 | 3.906891 | 11.770664 | 9.002815 | 4.000000 | 11.219774 | 3.321928 | 5.754888 | 3.000000 |
| "output.STAR.PCRPlate1B1_Hypoxia_S98_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.PCRPlate1B2_Hypoxia_S1_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 3.169925 | 0.0 | 1.584963 | 5.459432 | 1.000000 | 8.784635 | 0.000000 | 0.0 | ... | 0.000000 | 0.000000 | 0.000000 | 9.372865 | 6.845490 | 2.321928 | 9.802516 | 2.321928 | 3.459432 | 0.000000 |
| "output.STAR.PCRPlate1B3_Hypoxia_S5_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 1.584963 | 0.0 | 1.000000 | 5.832890 | 2.321928 | 8.179909 | 0.000000 | 0.0 | ... | 2.000000 | 1.000000 | 1.584963 | 9.350939 | 6.882643 | 2.000000 | 9.625709 | 0.000000 | 2.000000 | 0.000000 |
| "output.STAR.PCRPlate1B4_Hypoxia_S105_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 1.000000 | 2.807355 | 1.000000 | 5.700440 | 1.000000 | 0.0 | ... | 1.000000 | 0.000000 | 1.000000 | 7.845490 | 4.643856 | 0.000000 | 7.851749 | 0.000000 | 0.000000 | 0.000000 |
| "output.STAR.PCRPlate1B5_Hypoxia_S109_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 5.247928 | 1.000000 | 8.592457 | 0.000000 | 0.0 | ... | 2.000000 | 1.000000 | 2.000000 | 9.413628 | 6.714246 | 1.584963 | 9.348728 | 2.321928 | 2.807355 | 0.000000 |
| "output.STAR.PCRPlate1B6_Hypoxia_S12_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 5.044394 | 1.584963 | 8.154818 | 0.000000 | 0.0 | ... | 2.000000 | 1.584963 | 1.000000 | 9.469642 | 6.523562 | 1.000000 | 9.905387 | 1.584963 | 3.000000 | 0.000000 |
| "output.STAR.PCRPlate1B7_Normoxia_S114_Aligned.sortedByCoord.out.bam" | 0.0 | 0.0 | 0.000000 | 1.0 | 0.000000 | 6.228819 | 5.781360 | 10.087463 | 0.000000 | 1.0 | ... | 3.584963 | 2.807355 | 3.906891 | 11.429930 | 8.703904 | 3.459432 | 12.307201 | 3.169925 | 5.672425 | 4.087463 |
20 rows × 23342 columns
corr_features_HCC1806 = df_HCC1806_allVars_log2_noDup.iloc[0:20].T.corr()
sns.heatmap(corr_features_HCC1806,cmap='coolwarm', center=0)
<AxesSubplot:>
Just looking at the first 20 features, we notice on the correlation matrix red areas that indicate high positive correlations. We also notice some negative correlations (but not that high). Features with high correlations can be problematic for some machine learning algorithms. This problem is known as multicollinearity. In order to solve it, among a highly correlated pair of features, one should not be used in the model.
-------------- Memory Cleaning Start --------------
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
for elem in alldfs:
exec('del ' + elem)
gc.collect()
['HCC1806_smarts_filtered', 'HCC1806_smarts_filtered_normalized', 'HCC1806_smarts_metadata', 'HCC1806_smarts_unfiltered', 'HCC1806_smarts_unfiltered_info_sparsity_th90', 'HCC1806_smarts_unfiltered_info_sparsity_th95', 'HCC1806_smarts_unfiltered_noOut', '_100', '_107', '_109', '_116', '_118', '_120', '_124', '_125', '_126', '_127', '_130', '_133', '_134', '_135', '_143', '_73', '_77', '_81', '_90', '__', 'c2', 'c_dupl', 'corr_features_HCC1806', 'df_HCC1806_50vars_log2', 'df_HCC1806_allVars_log2', 'df_HCC1806_allVars_log2_noDup', 'df_HCC1806_allVars_log2_norm', 'df_HCC1806_allVars_log2_small', 'df_corr_hypo_cells2', 'df_corr_nohypo_cells2', 'df_highCorr_info2', 'df_lowCorr_info2', 'duplicate_rows_df_HCC1806_allVars_log2', 'duplicate_rows_df_HCC1806_allVars_log2_t', 'lower_matrix', 'lower_matrix_nohypo']
109
-------------- Memory Cleaning End --------------
I am loading the train set with 3000 features:
mcf7_train = pd.read_csv("SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_train)) # 3000 expressions of different genes, 250 cells
print("First column: ", mcf7_train.iloc[ : , 0])
Dataframe dimensions: (3000, 250)
First column: "CYP1B1" 343
"CYP1B1-AS1" 140
"CYP1A1" 0
"NDRG1" 0
"DDIT4" 386
...
"GRIK5" 0
"SLC25A27" 0
"DENND5A" 51
"CDK5R1" 0
"FAM13A-AS1" 0
Name: "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam", Length: 3000, dtype: int64
mcf7_train.describe()
| "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam" | "output.STAR.2_B4_Norm_S58_Aligned.sortedByCoord.out.bam" | "output.STAR.2_B5_Norm_S59_Aligned.sortedByCoord.out.bam" | "output.STAR.2_B6_Norm_S60_Aligned.sortedByCoord.out.bam" | "output.STAR.2_B7_Hypo_S79_Aligned.sortedByCoord.out.bam" | "output.STAR.2_B9_Hypo_S81_Aligned.sortedByCoord.out.bam" | "output.STAR.2_C10_Hypo_S130_Aligned.sortedByCoord.out.bam" | "output.STAR.2_C11_Hypo_S131_Aligned.sortedByCoord.out.bam" | "output.STAR.2_C1_Norm_S103_Aligned.sortedByCoord.out.bam" | "output.STAR.2_C2_Norm_S104_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.4_H10_Hypo_S382_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H14_Hypo_S383_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H1_Norm_S355_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H3_Norm_S357_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H4_Norm_S358_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H5_Norm_S359_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H6_Norm_S360_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H7_Hypo_S379_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H8_Hypo_S380_Aligned.sortedByCoord.out.bam" | "output.STAR.4_H9_Hypo_S381_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | ... | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 |
| mean | 74.140333 | 90.907000 | 99.089000 | 88.137000 | 110.395667 | 148.849000 | 126.422667 | 142.229667 | 91.781000 | 91.426333 | ... | 144.008333 | 133.846000 | 98.699333 | 84.070333 | 101.416333 | 96.636667 | 92.344333 | 154.387333 | 125.340000 | 132.017667 |
| std | 345.005307 | 409.560228 | 442.980702 | 425.804372 | 822.178446 | 1710.088769 | 1351.567001 | 1515.496440 | 388.660906 | 376.793214 | ... | 1349.125183 | 1242.320764 | 417.410827 | 406.100983 | 513.988262 | 499.224863 | 680.698856 | 1169.686762 | 1066.926126 | 1422.143351 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 24.000000 | 37.000000 | 33.000000 | 34.000000 | 38.250000 | 24.000000 | 13.000000 | 22.000000 | 37.000000 | 44.000000 | ... | 33.000000 | 38.000000 | 52.250000 | 25.000000 | 33.000000 | 44.000000 | 17.000000 | 19.000000 | 21.000000 | 20.250000 |
| max | 8222.000000 | 10167.000000 | 11446.000000 | 10312.000000 | 30586.000000 | 65037.000000 | 52680.000000 | 60789.000000 | 9394.000000 | 9077.000000 | ... | 56392.000000 | 50404.000000 | 11352.000000 | 8713.000000 | 17006.000000 | 16625.000000 | 29663.000000 | 34565.000000 | 34175.000000 | 57814.000000 |
8 rows × 250 columns
We want single cells to be our observations, and the gene expressions to be the features. So we transpose the dataset:
mcf7_train_T = mcf7_train.T
mcf7_train_T.describe()
| "CYP1B1" | "CYP1B1-AS1" | "CYP1A1" | "NDRG1" | "DDIT4" | "PFKFB3" | "HK2" | "AREG" | "MYBL2" | "ADM" | ... | "CD27-AS1" | "DNAI7" | "MAFG" | "LZTR1" | "BCO2" | "GRIK5" | "SLC25A27" | "DENND5A" | "CDK5R1" | "FAM13A-AS1" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | ... | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 | 250.000000 |
| mean | 5454.536000 | 2258.572000 | 1604.580000 | 606.380000 | 2487.000000 | 1495.920000 | 868.424000 | 308.984000 | 394.988000 | 183.096000 | ... | 22.036000 | 0.192000 | 50.884000 | 23.308000 | 0.192000 | 0.256000 | 0.160000 | 60.536000 | 2.860000 | 5.952000 |
| std | 8282.337795 | 3453.650882 | 5657.397449 | 766.718881 | 3422.213185 | 2109.376474 | 1837.399974 | 592.950034 | 564.259514 | 470.374582 | ... | 43.250493 | 2.000787 | 69.729761 | 36.415015 | 1.309195 | 2.001622 | 1.167842 | 75.647093 | 8.839056 | 21.649028 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 200.750000 | 85.000000 | 0.000000 | 1.000000 | 96.000000 | 71.000000 | 8.250000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 13.250000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 11.250000 | 0.000000 | 0.000000 |
| 50% | 795.000000 | 321.500000 | 0.000000 | 222.500000 | 1198.500000 | 369.500000 | 139.500000 | 18.000000 | 123.500000 | 0.000000 | ... | 8.000000 | 0.000000 | 37.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 38.500000 | 0.000000 | 0.000000 |
| 75% | 8180.500000 | 3348.500000 | 284.500000 | 1077.500000 | 3720.000000 | 2180.750000 | 1008.000000 | 371.000000 | 629.000000 | 74.500000 | ... | 30.000000 | 0.000000 | 63.000000 | 41.000000 | 0.000000 | 0.000000 | 0.000000 | 77.000000 | 0.000000 | 0.000000 |
| max | 44406.000000 | 17673.000000 | 58717.000000 | 3884.000000 | 21994.000000 | 12078.000000 | 16625.000000 | 3496.000000 | 3108.000000 | 3586.000000 | ... | 509.000000 | 28.000000 | 816.000000 | 196.000000 | 14.000000 | 26.000000 | 11.000000 | 639.000000 | 62.000000 | 288.000000 |
8 rows × 3000 columns
mcf7_train_T.head(1)
| "CYP1B1" | "CYP1B1-AS1" | "CYP1A1" | "NDRG1" | "DDIT4" | "PFKFB3" | "HK2" | "AREG" | "MYBL2" | "ADM" | ... | "CD27-AS1" | "DNAI7" | "MAFG" | "LZTR1" | "BCO2" | "GRIK5" | "SLC25A27" | "DENND5A" | "CDK5R1" | "FAM13A-AS1" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.2_B3_Norm_S57_Aligned.sortedByCoord.out.bam" | 343 | 140 | 0 | 0 | 386 | 75 | 0 | 0 | 476 | 0 | ... | 63 | 0 | 17 | 59 | 0 | 0 | 0 | 51 | 0 | 0 |
1 rows × 3000 columns
random_variable_indices = [randint(0, (mcf7_train_T.shape[1]-1)) for i in range(0,10)]
print(random_variable_indices)
for i in random_variable_indices:
sns.displot(
mcf7_train_T,
x= mcf7_train_T.columns.tolist()[i],
kind="kde"
)
[2064, 767, 382, 2457, 1385, 2697, 2798, 436, 1796, 605]
I am loading the test set and transposing it:
mcf7_test = pd.read_csv("SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_test)) # 3000 expressions of different genes, 250 cells
print("First column: ", mcf7_test.iloc[ : , 0])
Dataframe dimensions: (3000, 63)
First column: "CYP1B1" 492
"CYP1B1-AS1" 253
"CYP1A1" 0
"NDRG1" 1157
"DDIT4" 6805
...
"GRIK5" 0
"SLC25A27" 0
"DENND5A" 285
"CDK5R1" 0
"FAM13A-AS1" 1
Name: "1", Length: 3000, dtype: int64
mcf7_test.head()
| "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "10" | ... | "54" | "55" | "56" | "57" | "58" | "59" | "60" | "61" | "62" | "63" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "CYP1B1" | 492 | 7199 | 12 | 373 | 31 | 245 | 258 | 20634 | 6804 | 570 | ... | 202 | 120 | 340 | 7919 | 2015 | 287 | 15220 | 21998 | 39 | 195 |
| "CYP1B1-AS1" | 253 | 3245 | 11 | 187 | 13 | 109 | 106 | 8769 | 2911 | 246 | ... | 89 | 57 | 151 | 3515 | 929 | 126 | 6316 | 8898 | 17 | 81 |
| "CYP1A1" | 0 | 7181 | 1 | 0 | 0 | 0 | 0 | 4813 | 72 | 1 | ... | 0 | 0 | 0 | 0 | 75 | 666 | 1991 | 21329 | 1 | 0 |
| "NDRG1" | 1157 | 1857 | 5 | 0 | 0 | 10 | 3 | 1196 | 1168 | 2 | ... | 9 | 0 | 0 | 503 | 237 | 3270 | 750 | 1498 | 29 | 6 |
| "DDIT4" | 6805 | 20731 | 147 | 43 | 0 | 25 | 646 | 11080 | 2988 | 9 | ... | 1 | 0 | 4 | 5323 | 16733 | 25776 | 12176 | 5144 | 20 | 93 |
5 rows × 63 columns
mcf7_test_T = mcf7_test.T
First unsupervised learning technique we will use to find the hidden patterns in the data is PCA.
PCA is a dimensionality reduction technique which projects the data into a different vector space in the direction of the maximum variance:
We standardize each feature as usually applied for PCA analysis:
standardizer = StandardScaler()
mcf7_train_T_S = standardizer.fit_transform(mcf7_train_T)
mcf7_train_T_S.shape
(250, 3000)
pd.DataFrame(mcf7_train_T_S, columns=mcf7_train_T.columns.tolist()).describe().round(2)
| "CYP1B1" | "CYP1B1-AS1" | "CYP1A1" | "NDRG1" | "DDIT4" | "PFKFB3" | "HK2" | "AREG" | "MYBL2" | "ADM" | ... | "CD27-AS1" | "DNAI7" | "MAFG" | "LZTR1" | "BCO2" | "GRIK5" | "SLC25A27" | "DENND5A" | "CDK5R1" | "FAM13A-AS1" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | ... | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 | 250.00 |
| mean | 0.00 | -0.00 | -0.00 | 0.00 | 0.00 | -0.00 | 0.00 | 0.00 | -0.00 | -0.00 | ... | -0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| std | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| min | -0.66 | -0.66 | -0.28 | -0.79 | -0.73 | -0.71 | -0.47 | -0.52 | -0.70 | -0.39 | ... | -0.51 | -0.10 | -0.73 | -0.64 | -0.15 | -0.13 | -0.14 | -0.80 | -0.32 | -0.28 |
| 25% | -0.64 | -0.63 | -0.28 | -0.79 | -0.70 | -0.68 | -0.47 | -0.52 | -0.70 | -0.39 | ... | -0.51 | -0.10 | -0.54 | -0.64 | -0.15 | -0.13 | -0.14 | -0.65 | -0.32 | -0.28 |
| 50% | -0.56 | -0.56 | -0.28 | -0.50 | -0.38 | -0.54 | -0.40 | -0.49 | -0.48 | -0.39 | ... | -0.33 | -0.10 | -0.20 | -0.61 | -0.15 | -0.13 | -0.14 | -0.29 | -0.32 | -0.28 |
| 75% | 0.33 | 0.32 | -0.23 | 0.62 | 0.36 | 0.33 | 0.08 | 0.10 | 0.42 | -0.23 | ... | 0.18 | -0.10 | 0.17 | 0.49 | -0.15 | -0.13 | -0.14 | 0.22 | -0.32 | -0.28 |
| max | 4.71 | 4.47 | 10.12 | 4.28 | 5.71 | 5.03 | 8.59 | 5.39 | 4.82 | 7.25 | ... | 11.28 | 13.93 | 10.99 | 4.75 | 10.57 | 12.89 | 9.30 | 7.66 | 6.70 | 13.05 |
8 rows × 3000 columns
Now that our data set 0 mean and unit variance, we can apply PCA transformation.
We are not indicating the number of components, according to the documentation, since the max number of samples is less than the number of features, the PCA algorithm of sklearn will give us components as many as the number of samples.
We will do analyses later on to understand the relationship between the information in the dataset and the optimal number of components to use:
pca_mcf7 = PCA(random_state=101)
mcf7_train_T_S_PCA = pca_mcf7.fit_transform(mcf7_train_T_S)
explained_variance_mcf7 = pca_mcf7.explained_variance_ratio_
cumulative_sum_variance = np.cumsum(explained_variance_mcf7)
We plot the cumulative variance of the first then principal components:
plt.plot(np.arange(1,11,1), cumulative_sum_variance[0:10])
plt.ylabel('cumulative variance')
Text(0, 0.5, 'cumulative variance')
The first ten principal components explain 24% of variance of the dataset. If we use only the first two principal components, alone they are able to reflect the 10% of the total variance. Including the third component brings only 3% more information.
plt.plot(np.arange(1,101,1), cumulative_sum_variance[0:100])
plt.ylabel('cumulative variance')
Text(0, 0.5, 'cumulative variance')
We need to use 100 components to account for 70% of the variance in the data. Given that we have 3000 features initally, it can be a good comparison to use only one third of the features.
For this exercise, we can plot Hypoxia cells and no hypoxia cells using the two principal components:
Y_train = mcf7_train_T.reset_index()['index'].apply(lambda x_str: 1 if 'Hypo' in x_str else 0)
plt.figure()
plot = plt.scatter(mcf7_train_T_S_PCA[:,0], mcf7_train_T_S_PCA[:,1], c=Y_train, cmap="bwr_r")
plt.legend(handles=plot.legend_elements()[0], labels=["Hypoxia","Normal"])
plt.show()
We apply the PCA transformer to the test set:
mcf7_test_T_S = standardizer.transform(mcf7_test_T)
mcf7_test_T_S_PCA = pca_mcf7.transform(mcf7_test_T_S)
plt.figure()
plot = plt.scatter(mcf7_test_T_S_PCA[:,0], mcf7_test_T_S_PCA[:,1])
plt.legend(handles=plot.legend_elements()[0])
plt.show()
/Users/jillybean/opt/anaconda3/lib/python3.8/site-packages/matplotlib/collections.py:1039: UserWarning: Collection without array used. Make sure to specify the values to be colormapped via the `c` argument.
warnings.warn("Collection without array used. Make sure to "
We do not have the Hypoxia condition of the cells in the test set, we cannot see the separation by condition as we did with the train set. But we would expect the cells on the left of 0 and on the right of 0 to represent the two conditions.
Isomap is short for Isometric Mapping. Isomap is a non-linear way of reducing dimentionality while preserving local structures. It's actually a combination of different algorithms: k-nearest neighbors (KNN), a shortest path algorithm (which could be the Dijkstra’s algorithm, for example), and Multidimensional Scaling (MDS). Isomap is distinguished from MDS by the preservation of geodesic distances, which results in the preservation of manifold structures in the resulting embedding. The goal of this mapping is to maintain a geodesic distance between two points. Geodesic is more formally defined as the shortest path on the surface itself. ???????
isomap = Isomap()
mcf7_train_T_S_isomap = isomap.fit_transform(mcf7_train_T_S)
plt.figure()
plot = plt.scatter(mcf7_train_T_S_isomap[:,0], mcf7_train_T_S_isomap[:,1], c=Y_train, cmap="bwr_r")
plt.legend(handles=plot.legend_elements()[0], labels=["Hypoxia","Normal"])
plt.show()
mcf7_train_T_S_isomap.shape
(250, 2)
Isomap distinguishes two cases. With only one dimension (dim1=-50) we can differentiate Hypoxia and Normal cells with small error for theobservations between -50 and 0.
As a third dimensionality reduction method, we will try T-SNE algorithm. T-SNE works well on sets with non-linear variance. It tries to maximize the distance between the probability distributions of similar and dissimilar observations.
tsne_mcf7 = TSNE(random_state=101)
mcf7_train_T_S_tsne = tsne_mcf7.fit_transform(mcf7_train_T_S)
plt.figure()
plot = plt.scatter(mcf7_train_T_S_tsne[:,0], mcf7_train_T_S_tsne[:,1], c=Y_train, cmap="bwr_r")
plt.legend(handles=plot.legend_elements()[0], labels=["Hypoxia","Normal"])
plt.show()
TSNE implementation in sklearn does not have transform method so we cannot apply it to the train set to see if there will be well separated clusters.
We can use our pca or tsne transformed dataset to find the clusters with Kmeans algorithm. We used these trasformed dataset to help the algorithm, otherwise we can use the not transformed dataset as well and we tried that way too.
The metric that is minimized usually is called inertia: it is the sum of squared distance of samples to their closest cluster centroid.
We choose the use 3 components in PCA to transform the data (there is little difference between 2 and 3 as we explained above):
pca_mcf7_final = PCA(n_components=3, random_state=101)
mcf7_train_T_S_PCA3 = pca_mcf7_final.fit_transform(mcf7_train_T_S)
pca_mcf7_final2 = PCA(n_components=2, random_state=101)
mcf7_train_T_S_PCA2 = pca_mcf7_final.fit_transform(mcf7_train_T_S)
We calculate intertia for Kmeans from 1 to 10 clusters:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(mcf7_train_T_S_PCA3)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
There is a break point when the number of clusters equal to 2, so we decide to obtain 2 groups. We visualize the samples and their cluster centoid.
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(mcf7_train_T_S_PCA3)
plt.scatter(mcf7_train_T_S_PCA3[:,0], mcf7_train_T_S_PCA3[:,1],c=Y_train, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='black')
plt.show()
Performing Kmeans on PCA transformed dataset fo the MCF cell line works well. We found two clusters of Hypoxia and Normal Conditions. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which match with the truth of the dataset.
We apply the same to the TSNE transformed data:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(mcf7_train_T_S_tsne)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
2 clusters minimize inertia a lot:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(mcf7_train_T_S_tsne)
plt.scatter(mcf7_train_T_S_tsne[:,0], mcf7_train_T_S_tsne[:,1], c=Y_train, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='black')
plt.show()
Kmeans on TSNE trasnformed data worked even bettter than PCA. We identify the centers of the two condition groups very well. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which match with the truth of the dataset.
We apply the same to the original data that is not standardized and not PCA transformed:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(mcf7_train_T)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
For the original data, there is a second elbow at x=3:
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(mcf7_train_T)
#ax = plt.figure(figsize=(9,9)).add_subplot(projection='3d')
plt.scatter(mcf7_train_T.to_numpy()[:,0], mcf7_train_T.to_numpy()[:,1], c=Y_train, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='black', s=50)
plt.show()
Three clusters do not work well. What about 2 clusters?
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(mcf7_train_T_S)
plt.scatter(mcf7_train_T.to_numpy()[:,0], mcf7_train_T.to_numpy()[:,1], c=Y_train, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='black', s=50)
plt.show()
We obtained bad results with the origial dataset. It is better to use standardization and dimension reduction, and do clustering on the dataset that is projected on the essential dimensions found.
Finally we apply the Kmeans to the test set that is PCA transformed with three components to explore:
mcf7_test_T_S_PCA3 = pca_mcf7_final.transform(mcf7_test_T_S)
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(mcf7_train_T_S_PCA3)
plt.scatter(mcf7_test_T_S_PCA3[:,0], mcf7_test_T_S_PCA3[:,1])
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red')
plt.show()
It looks very much like the figure of Kmeans applied on the PCA transformed traning set.
----- Memory Cleaning Start -----
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
for df_elem in alldfs:
if df_elem not in ['mcf7_train_T', 'mcf7_train_T_S', 'Y_train']:
exec('del ' + df_elem)
gc.collect()
['_147', '_149', '_150', '_153', '_157', 'mcf7_test', 'mcf7_test_T', 'mcf7_train', 'mcf7_train_T']
27194
#lower_matrix # check to see if lower_matrix is deleted: gives error -> ok worked
----- Memory Cleaning End -----
I am loading the train set with 3000 features:
HCC1806_train = pd.read_csv("SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_train)) # 3000 expressions of different genes, 250 cells
print("First column: ", HCC1806_train.iloc[ : , 0])
Dataframe dimensions: (3000, 182)
First column: "DDIT4" 0
"ANGPTL4" 48
"CALML5" 0
"KRT14" 321
"CCNB1" 298
...
"LINC02693" 29
"OR8B9P" 0
"NEAT1" 29
"ZDHHC23" 0
"ODAD2" 0
Name: "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam", Length: 3000, dtype: int64
HCC1806_train.describe()
| "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G1_Hypoxia_S102_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G2_Hypoxia_S2_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G3_Hypoxia_S7_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G4_Hypoxia_S107_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G7_Normoxia_S118_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G8_Normoxia_S19_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1G9_Normoxia_S121_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1H1_Hypoxia_S103_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate1H2_Hypoxia_S3_Aligned.sortedByCoord.out.bam" | ... | "output.STAR.PCRPlate4G12_Normoxia_S243_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G1_Hypoxia_S193_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G2_Hypoxia_S198_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G6_Hypoxia_S232_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4G7_Normoxia_S204_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H10_Normoxia_S210_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H11_Normoxia_S214_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H2_Hypoxia_S199_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H7_Normoxia_S205_Aligned.sortedByCoord.out.bam" | "output.STAR.PCRPlate4H9_Normoxia_S236_Aligned.sortedByCoord.out.bam" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.00000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | ... | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 | 3000.000000 |
| mean | 149.353000 | 182.303000 | 178.945667 | 168.183333 | 184.400000 | 168.87200 | 223.504333 | 156.678667 | 178.393333 | 183.545667 | ... | 153.646333 | 175.213000 | 188.859333 | 196.469333 | 144.678000 | 146.055000 | 162.045667 | 182.989000 | 155.877667 | 130.704000 |
| std | 1052.553246 | 871.447201 | 965.087457 | 918.214156 | 1267.698452 | 1607.97906 | 2453.417156 | 1312.696362 | 998.494048 | 1011.438386 | ... | 899.533313 | 966.105199 | 1246.163219 | 1128.887322 | 659.116172 | 649.928442 | 783.003904 | 1081.586336 | 887.607124 | 716.861167 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 5.250000 | 33.000000 | 0.000000 | 30.000000 | 0.000000 | 0.00000 | 1.000000 | 23.000000 | 21.000000 | 0.000000 | ... | 25.000000 | 16.000000 | 7.000000 | 19.000000 | 18.000000 | 26.000000 | 28.000000 | 12.000000 | 3.500000 | 17.000000 |
| max | 39148.000000 | 22572.000000 | 21430.000000 | 24033.000000 | 32768.000000 | 59650.00000 | 109881.000000 | 61737.000000 | 32269.000000 | 26064.000000 | ... | 23857.000000 | 26918.000000 | 38157.000000 | 37232.000000 | 11028.000000 | 12319.000000 | 17681.000000 | 29201.000000 | 18969.000000 | 23424.000000 |
8 rows × 182 columns
We want single cells to be our observations, and the gene expressions to be the features. So we transpose the dataset:
HCC1806_train_T = HCC1806_train.T
HCC1806_train_T.describe()
| "DDIT4" | "ANGPTL4" | "CALML5" | "KRT14" | "CCNB1" | "IGFBP3" | "AKR1C2" | "KRT6A" | "NDRG1" | "KRT4" | ... | "MST1R" | "ZYG11A" | "NRG1" | "RBMS3" | "VCPIP1" | "LINC02693" | "OR8B9P" | "NEAT1" | "ZDHHC23" | "ODAD2" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.00000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | ... | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 | 182.000000 |
| mean | 4038.736264 | 1227.164835 | 398.175824 | 921.307692 | 867.087912 | 1271.28022 | 1407.873626 | 1729.543956 | 457.895604 | 396.637363 | ... | 123.302198 | 4.291209 | 138.868132 | 7.192308 | 56.675824 | 57.098901 | 0.153846 | 102.076923 | 12.236264 | 2.175824 |
| std | 4165.241080 | 1949.430648 | 886.985647 | 2387.091444 | 1268.359981 | 2810.10746 | 2582.400094 | 3709.250440 | 627.115448 | 1541.330938 | ... | 119.297816 | 11.790896 | 147.116950 | 21.864761 | 123.906663 | 86.519829 | 1.060835 | 140.866555 | 28.353222 | 8.754596 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 |
| 25% | 295.000000 | 4.250000 | 0.000000 | 13.750000 | 61.250000 | 83.00000 | 140.250000 | 313.500000 | 4.000000 | 0.000000 | ... | 27.500000 | 0.000000 | 36.250000 | 0.000000 | 0.000000 | 0.250000 | 0.000000 | 25.000000 | 0.000000 | 0.000000 |
| 50% | 2729.500000 | 290.000000 | 0.000000 | 320.500000 | 299.500000 | 270.50000 | 484.000000 | 737.000000 | 143.500000 | 0.000000 | ... | 98.500000 | 0.000000 | 97.000000 | 0.000000 | 36.000000 | 34.000000 | 0.000000 | 57.000000 | 0.000000 | 0.000000 |
| 75% | 6933.500000 | 1857.750000 | 465.750000 | 1029.000000 | 988.500000 | 1163.50000 | 1472.000000 | 1774.000000 | 688.750000 | 0.000000 | ... | 185.750000 | 0.000000 | 191.000000 | 6.000000 | 74.250000 | 68.000000 | 0.000000 | 117.500000 | 11.500000 | 0.000000 |
| max | 16700.000000 | 14032.000000 | 5482.000000 | 28680.000000 | 6914.000000 | 21554.00000 | 20195.000000 | 41946.000000 | 3356.000000 | 9902.000000 | ... | 751.000000 | 68.000000 | 852.000000 | 246.000000 | 1545.000000 | 615.000000 | 11.000000 | 966.000000 | 222.000000 | 67.000000 |
8 rows × 3000 columns
HCC1806_train_T.head(1)
| "DDIT4" | "ANGPTL4" | "CALML5" | "KRT14" | "CCNB1" | "IGFBP3" | "AKR1C2" | "KRT6A" | "NDRG1" | "KRT4" | ... | "MST1R" | "ZYG11A" | "NRG1" | "RBMS3" | "VCPIP1" | "LINC02693" | "OR8B9P" | "NEAT1" | "ZDHHC23" | "ODAD2" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "output.STAR.PCRPlate1G12_Normoxia_S32_Aligned.sortedByCoord.out.bam" | 0 | 48 | 0 | 321 | 298 | 82 | 6250 | 634 | 0 | 0 | ... | 78 | 10 | 136 | 0 | 0 | 29 | 0 | 29 | 0 | 0 |
1 rows × 3000 columns
random_variable_indices = [randint(0, (HCC1806_train_T.shape[1]-1)) for i in range(0,10)]
print(random_variable_indices)
for i in random_variable_indices:
sns.displot(
HCC1806_train_T,
x= HCC1806_train_T.columns.tolist()[i],
kind="kde"
)
[2587, 2312, 725, 1118, 171, 1520, 1961, 1175, 1423, 1276]
I am loading the test set and transposing it:
HCC1806_test = pd.read_csv("SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_test)) # 3000 expressions of different genes, 250 cells
print("First column: ", HCC1806_test.iloc[ : , 0])
Dataframe dimensions: (3000, 45)
First column: "DDIT4" 0
"ANGPTL4" 0
"CALML5" 0
"KRT14" 169
"CCNB1" 233
...
"LINC02693" 48
"OR8B9P" 0
"NEAT1" 118
"ZDHHC23" 6
"ODAD2" 0
Name: "1", Length: 3000, dtype: int64
HCC1806_test.head()
| "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" | "10" | ... | "36" | "37" | "38" | "39" | "40" | "41" | "42" | "43" | "44" | "45" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| "DDIT4" | 0 | 2475 | 9088 | 6909 | 13655 | 7684 | 13038 | 640 | 3 | 542 | ... | 4968 | 301 | 2264 | 122 | 1845 | 3654 | 437 | 10496 | 1506 | 1035 |
| "ANGPTL4" | 0 | 0 | 2143 | 3086 | 2196 | 1619 | 1917 | 1 | 1 | 81 | ... | 1634 | 153 | 127 | 47 | 643 | 1807 | 2 | 0 | 374 | 0 |
| "CALML5" | 0 | 0 | 0 | 0 | 0 | 6 | 0 | 0 | 0 | 0 | ... | 797 | 0 | 0 | 0 | 0 | 0 | 1526 | 1 | 635 | 0 |
| "KRT14" | 169 | 0 | 0 | 909 | 1 | 74 | 108 | 1 | 1794 | 99 | ... | 53 | 745 | 0 | 1917 | 1 | 0 | 3143 | 643 | 349 | 998 |
| "CCNB1" | 233 | 3537 | 124 | 78 | 1 | 0 | 991 | 128 | 256 | 180 | ... | 81 | 1647 | 732 | 380 | 3309 | 973 | 82 | 0 | 26 | 5161 |
5 rows × 45 columns
HCC1806_test_T = HCC1806_test.T
First unsupervised learning technique we will use to find the hidden patterns in the data is PCA.
PCA is a dimensionality reduction technique which projects the data into a different vector space in the direction of the maximum variance:
We standardize each feature as usually applied for PCA analysis:
standardizer = StandardScaler()
HCC1806_train_T_S = standardizer.fit_transform(HCC1806_train_T)
HCC1806_train_T_S.shape
(182, 3000)
pd.DataFrame(HCC1806_train_T_S, columns=HCC1806_train_T.columns.tolist()).describe().round(2)
| "DDIT4" | "ANGPTL4" | "CALML5" | "KRT14" | "CCNB1" | "IGFBP3" | "AKR1C2" | "KRT6A" | "NDRG1" | "KRT4" | ... | "MST1R" | "ZYG11A" | "NRG1" | "RBMS3" | "VCPIP1" | "LINC02693" | "OR8B9P" | "NEAT1" | "ZDHHC23" | "ODAD2" | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | ... | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 | 182.00 |
| mean | -0.00 | -0.00 | -0.00 | 0.00 | 0.00 | 0.00 | -0.00 | -0.00 | 0.00 | 0.00 | ... | 0.00 | -0.00 | 0.00 | -0.00 | -0.00 | -0.00 | 0.00 | -0.00 | 0.00 | 0.00 |
| std | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | ... | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
| min | -0.97 | -0.63 | -0.45 | -0.39 | -0.69 | -0.45 | -0.55 | -0.47 | -0.73 | -0.26 | ... | -1.04 | -0.36 | -0.95 | -0.33 | -0.46 | -0.66 | -0.15 | -0.72 | -0.43 | -0.25 |
| 25% | -0.90 | -0.63 | -0.45 | -0.38 | -0.64 | -0.42 | -0.49 | -0.38 | -0.73 | -0.26 | ... | -0.81 | -0.36 | -0.70 | -0.33 | -0.46 | -0.66 | -0.15 | -0.55 | -0.43 | -0.25 |
| 50% | -0.32 | -0.48 | -0.45 | -0.25 | -0.45 | -0.36 | -0.36 | -0.27 | -0.50 | -0.26 | ... | -0.21 | -0.36 | -0.29 | -0.33 | -0.17 | -0.27 | -0.15 | -0.32 | -0.43 | -0.25 |
| 75% | 0.70 | 0.32 | 0.08 | 0.05 | 0.10 | -0.04 | 0.02 | 0.01 | 0.37 | -0.26 | ... | 0.52 | -0.36 | 0.36 | -0.05 | 0.14 | 0.13 | -0.15 | 0.11 | -0.03 | -0.25 |
| max | 3.05 | 6.59 | 5.75 | 11.66 | 4.78 | 7.24 | 7.30 | 10.87 | 4.63 | 6.18 | ... | 5.28 | 5.42 | 4.86 | 10.95 | 12.04 | 6.47 | 10.25 | 6.15 | 7.42 | 7.43 |
8 rows × 3000 columns
Now that our data set 0 mean and unit variance, we can apply PCA transformation.
We are not indicating the number of components, according to the documentation, since the max number of samples is less than the number of features, the PCA algorithm of sklearn will give us components as many as the number of samples.
We will do analyses later on to understand the relationship between the information in the dataset and the optimal number of components to use:
pca_HCC1806 = PCA(random_state=101)
HCC1806_train_T_S_PCA = pca_HCC1806.fit_transform(HCC1806_train_T_S)
explained_variance_HCC1806 = pca_HCC1806.explained_variance_ratio_
cumulative_sum_variance2 = np.cumsum(explained_variance_HCC1806)
We plot the cumulative variance of the first then principal components:
plt.plot(np.arange(1,11,1), cumulative_sum_variance2[0:10])
plt.ylabel('cumulative variance')
Text(0, 0.5, 'cumulative variance')
The first ten principal components explain 24% of variance of the dataset. If we use only the first two principal components, alone they are able to reflect less than 7% of the total variance. Including the third component brings only 3% more information.
plt.plot(np.arange(1,101,1), cumulative_sum_variance2[0:100])
plt.ylabel('cumulative variance')
Text(0, 0.5, 'cumulative variance')
We need to use 100 components to account for 80% of the variance in the data. Given that we have 3000 features initally, it can be a good comparison to use only one third of the features.
For this exercise, we can plot Hypoxia cells and No Hypoxia cells using the two principal components:
Y_train2 = HCC1806_train_T.reset_index()['index'].apply(lambda x_str: 1 if 'Hypo' in x_str else 0)
plt.figure()
plot = plt.scatter(HCC1806_train_T_S_PCA[:,0], HCC1806_train_T_S_PCA[:,1], c=Y_train2, cmap="bwr_r")
plt.legend(handles=plot.legend_elements()[0], labels=["Hypoxia","Normal"])
plt.show()
PCA with two components do not separate the data well enough, worse than its performance with the data of the other cell line.
We apply the PCA transformer to the test set:
HCC1806_test_T_S = standardizer.transform(HCC1806_test_T)
HCC1806_test_T_S_PCA = pca_HCC1806.transform(HCC1806_test_T_S)
plt.figure()
plot = plt.scatter(HCC1806_test_T_S_PCA[:,0], HCC1806_test_T_S_PCA[:,1])
plt.legend(handles=plot.legend_elements()[0])
plt.show()
/Users/jillybean/opt/anaconda3/lib/python3.8/site-packages/matplotlib/collections.py:1039: UserWarning: Collection without array used. Make sure to specify the values to be colormapped via the `c` argument.
warnings.warn("Collection without array used. Make sure to "
There are no well defined clusters observed in the test data.
Let's try again with reducing the train data into three components:
pca_HCC1806 = PCA(n_components=3)
principalComponents = pca_HCC1806.fit_transform(HCC1806_train_T_S)
plt.figure()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(principalComponents[:,0],principalComponents[:,1], principalComponents[:,2], c =Y_train2, cmap="bwr_r", s = 30)
plt.show()
<Figure size 432x288 with 0 Axes>
Also with tree components, it is still not enough for us to be able to indentify two different clusters. The third dimension did not add much information.
As a second dimensionality reduction method, we will try T-SNE algorithm. T-SNE works well on sets with non-linear variance. It tries to maximize the distance between the probability distributions of similar and dissimilar observations.
tsne_HCC1806 = TSNE(random_state=101)
HCC1806_train_T_S_tsne = tsne_HCC1806.fit_transform(HCC1806_train_T_S)
HCC1806_train_T_S_tsne.shape
(182, 2)
We obtain only 2 components.
plt.figure()
plot = plt.scatter(HCC1806_train_T_S_tsne[:,0], HCC1806_train_T_S_tsne[:,1], c=Y_train2, cmap="bwr_r")
plt.legend(handles=plot.legend_elements()[0], labels=["Hypoxia","Normal"])
plt.show()
The conditions are not separable from each other with TSNE solution. We try with three components:
tsne_HCC1806 = TSNE(n_components=3)
principalComponents = tsne_HCC1806.fit_transform(HCC1806_train_T_S)
principalComponents.shape
(182, 3)
plt.figure()
fig = plt.figure(figsize=(12, 12))
ax = fig.add_subplot(projection='3d')
ax.scatter(principalComponents[:,0],principalComponents[:,1], principalComponents[:,2], c=Y_train2, cmap="bwr_r", s=30)
plt.show()
<Figure size 432x288 with 0 Axes>
Neither the solution in 3d works well.
If we compare the two methods, TSNE works less well for HCC1806 data.
We can use our pca or tsne transformed dataset to find the clusters with Kmeans algorithm. We used these trasformed dataset to help the algorithm, otherwise we can use the not transformed dataset as well and we tried that way too.
The metric that is minimized usually is called inertia: it is the sum of squared distance of samples to their closest cluster centroid.
We choose the use 3 components in PCA to transform the data (there is little difference between 2 and 3 as we explained above):
pca_HCC1806_final = PCA(n_components=3, random_state=101)
HCC1806_train_T_S_PCA3 = pca_HCC1806_final.fit_transform(HCC1806_train_T_S)
pca_HCC1806_final2 = PCA(n_components=2, random_state=101)
HCC1806_train_T_S_PCA2 = pca_HCC1806_final2.fit_transform(HCC1806_train_T_S)
We calculate intertia for Kmeans from 1 to 10 clusters:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(HCC1806_train_T_S_PCA3)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
There is a break point when the number of clusters equal to 4.
We train a Kmeans algorithm to obtain 4 clusters. Then we visualize in 2D the clusters centers with different components of the data to explore if a there is good separation indicated in one of them:
kmeans = KMeans(n_clusters=4, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T_S_PCA3)
#ax = plt.figure(figsize=(9,9)).add_subplot(projection='3d')
plt.scatter(HCC1806_train_T_S_PCA3[:,0], HCC1806_train_T_S_PCA3[:,1], c=Y_train2, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=60, c='black')
plt.xlabel('First component')
plt.ylabel('Second component')
plt.show()
plt.scatter(HCC1806_train_T_S_PCA3[:,0], HCC1806_train_T_S_PCA3[:,2], c=Y_train2, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=60, c='black')
plt.xlabel('First component')
plt.ylabel('Third component')
plt.show()
plt.scatter(HCC1806_train_T_S_PCA3[:,1], HCC1806_train_T_S_PCA3[:,2], c=Y_train2, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=60, c='black')
plt.xlabel('Second component')
plt.ylabel('Third component')
plt.show()
With standardized and PCA transformed dataset of cell type HCC1806, kmeans does not give good result.
We apply the same to the TSNE transformed data:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(HCC1806_train_T_S_tsne)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
2 clusters minimize inertia a lot as there is an inital bend. We try the solution with 2 clusters:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T_S_tsne)
plt.scatter(HCC1806_train_T_S_tsne[:,0], HCC1806_train_T_S_tsne[:,1], c=Y_train2, cmap="bwr_r")
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=30, c='black')
plt.show()
This solution is not able to cluster the data into separate groups of oxygen condition. In red and in blue we display Hypoxia and Normal conditions (the truth of the dataset) and in black we represent the found cluster centoids which DO NOT match with the truth of the dataset.
We apply the same to the standardized original dataset:
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
kmeans.fit(HCC1806_train_T_S)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
We try a solution with 2 clusters. We tried a bunch of pair of features to visualize:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T_S)
#ax = plt.figure(figsize=(9,9)).add_subplot(projection='3d')
plt.scatter(HCC1806_train_T_S[:,0], HCC1806_train_T_S[:,10],c=Y_train2, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=40, c='black')
plt.xlabel(HCC1806_train_T.columns.tolist()[0])
plt.ylabel(HCC1806_train_T.columns.tolist()[10])
plt.show()
2 clusters solution do not work well. We try with 3 dimensions:
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T_S)
ax = plt.figure(figsize=(9,9)).add_subplot(projection='3d')
ax.scatter(HCC1806_train_T_S[:,0], HCC1806_train_T_S[:,1], HCC1806_train_T_S[:,2], c=Y_train2, cmap="bwr_r", s=30)
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], s=40, c='black')
plt.show()
As these do not work well, we try with the original data not standardized. We plot the first and third features. The cluster centers are too close to each other:
kmeans = KMeans(n_clusters=2, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T)
plt.scatter(HCC1806_train_T.to_numpy()[:,0], HCC1806_train_T.to_numpy()[:,1], c=Y_train2, cmap="bwr_r", s=30)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], c='black', s=40)
plt.show()
kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
pred_y = kmeans.fit_predict(HCC1806_train_T)
ax = plt.figure(figsize=(9,9)).add_subplot(projection='3d')
ax.scatter(HCC1806_train_T.to_numpy()[:,0], HCC1806_train_T.to_numpy()[:,2], HCC1806_train_T.to_numpy()[:,2], c=Y_train2, cmap="bwr_r", s=30)
ax.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], kmeans.cluster_centers_[:, 2], c='black', s=40)
plt.show()
We obtained bad results with the origial dataset
It is better to use standardization and dimension reduction, and do clustering on the dataset that is projected on the essential dimensions found.
Unlike the dataset of MCF1806, for the dataset HC1806 PCA worked better than the kmeans and TSNE techniques to separate the data into two clusters according to Hypnoxia conditions.
--------- Memory Cleaning Start ---------
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
for df_elem in alldfs:
if df_elem not in ['mcf7_train_T', 'mcf7_train_T_S', 'Y_train', 'HCC1806_train_T', 'HCC1806_train_T_S', 'Y_train']:
exec('del ' + df_elem)
gc.collect()
['HCC1806_test', 'HCC1806_test_T', 'HCC1806_train', 'HCC1806_train_T', '_181', '_183', '_184', '_187', '_191', 'mcf7_train_T']
40231
--------- Memory Cleaning End ---------
We will answer the questionsasked on the report template:
The first task could be to develop a classifier in each cell type using one or more ML appraoches: For each cell type, we will build 3 different classfiers: Logistics Regression, Random Forest and Perceptron.
We will use PCA reduced data. As we explained above 100 components were enough to cover 70%-80% of the whole set. Using 3000 features may not be optimal as we only have 250 observations in one dataset and 182 in the other one. More features require more data.
Could you apply some feature selection? How will you apply it?: We apply feature selection by considering the first 100 principal components.
print(HCC1806_train_T.shape)
print(Y_train2.shape)
print('Hypoxia condition percentage in HCC1806:', Y_train2[Y_train2 == 1].sum()/len(Y_train2))
(182, 3000) (182,) Hypoxia condition percentage in HCC1806: 0.532967032967033
print(mcf7_train_T.shape)
print(Y_train.shape)
print('Hypoxia condition percentage in MCF7:', Y_train[Y_train == 1].sum()/len(Y_train))
(250, 3000) (250,) Hypoxia condition percentage in MCF7: 0.496
pca_mcf7 = PCA(random_state=101, n_components=100)
mcf7_train_T_S_PCA100 = pca_mcf7.fit_transform(mcf7_train_T_S)
pca_hcc1806 = PCA(random_state=101, n_components=100)
hcc1806_train_T_S_PCA100 = pca_hcc1806.fit_transform(HCC1806_train_T_S)
print(hcc1806_train_T_S_PCA100.shape)
print(Y_train2.shape)
print(mcf7_train_T_S_PCA100.shape)
print(Y_train.shape)
(182, 100) (182,) (250, 100) (250,)
For both cell lines, we fit a Random Forest model. We find the best Random Forest model to fit using cross validation scores of the grid search.
For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.
from sklearn.ensemble import RandomForestClassifier
model_RF=RandomForestClassifier(random_state=42)
dict_params_RF = {
"n_estimators" : [1,2],
"criterion" :("gini", "entropy"),
"max_depth": [2,3],
"min_samples_split" : [5,],
"max_features": ("sqrt", "log2")
}
### RF model MCF7 ###
grid_RF_mcf7=GridSearchCV(model_RF, dict_params_RF)
grid_RF_mcf7.fit(mcf7_train_T_S_PCA100, Y_train)
cv_scores_RF_mcf7 = cross_val_score(grid_RF_mcf7, mcf7_train_T_S_PCA100, Y_train, cv=3, scoring='accuracy')
### RF model HCC1806 ###
grid_RF_hcc1806=GridSearchCV(model_RF, dict_params_RF)
grid_RF_hcc1806.fit(hcc1806_train_T_S_PCA100, Y_train2)
cv_scores_RF_hcc1806 = cross_val_score(grid_RF_hcc1806, hcc1806_train_T_S_PCA100, Y_train2, cv=3, scoring='accuracy')
print('Results Random Forest Model MCF7 cell')
print('*'*30)
print(cv_scores_RF_mcf7)
print('Average Performance:', cv_scores_RF_mcf7.mean())
print(cv_scores_RF_mcf7.std())
print('-'*30)
print('Results Random Forest Model HCC1806 cell')
print('*'*30)
print(cv_scores_RF_hcc1806)
print('Average Performance:', cv_scores_RF_hcc1806.mean())
print(cv_scores_RF_hcc1806.std())
Results Random Forest Model MCF7 cell ****************************** [0.89285714 0.81927711 0.56626506] Average Performance: 0.7594664371772805 0.1398775283453446 ------------------------------ Results Random Forest Model HCC1806 cell ****************************** [0.81967213 0.73770492 0.76666667] Average Performance: 0.7746812386156648 0.033939466005366375
Random Forest model has better auracy for HCC1806 (77%) than for MCF7 (76%). The cross-validation scores have less std for HCC1806 (a more stable performance across folds for this cell type).
print('Best Random Forest Model obtained for MCF7 cell')
print('*'*30)
print(grid_RF_mcf7.best_params_)
print('-'*30)
print('Best Random Forest Model obtained for MCF7 cell')
print('*'*30)
print(grid_RF_hcc1806.best_params_)
Best Random Forest Model obtained for MCF7 cell
******************************
{'criterion': 'gini', 'max_depth': 3, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 2}
------------------------------
Best Random Forest Model obtained for MCF7 cell
******************************
{'criterion': 'entropy', 'max_depth': 3, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 2}
The best models obtained have the same hyperparameters for both cell types. We fit this model:
model_RF_best =RandomForestClassifier(
random_state=42,
criterion=grid_RF_mcf7.best_params_['criterion'],
max_depth=grid_RF_mcf7.best_params_['max_depth'],
max_features=grid_RF_mcf7.best_params_['max_features'],
min_samples_split=grid_RF_mcf7.best_params_['min_samples_split'],
n_estimators=grid_RF_mcf7.best_params_['n_estimators'],
)
model_RF_best_mcf7 = model_RF_best.fit(mcf7_train_T_S_PCA100, Y_train)
model_RF_best_hcc1806 = model_RF_best.fit(hcc1806_train_T_S_PCA100, Y_train2)
For both cell lines, we fit a Logistic Regression model. We find the best Logistic Regression model to fit using cross validation scores of the grid search. For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.
from sklearn.linear_model import LogisticRegression
model_LR=LogisticRegression(random_state=42)
dict_params_LR = {
"C" : [0.5, 1, 2, 5, 10],
"solver" :("newton-cg", "lbfgs", "liblinear"),
"multi_class": ("auto", "ovr"),
}
### LR model MCF7 ###
grid_LR_mcf7 = GridSearchCV(model_LR, dict_params_LR)
grid_LR_mcf7.fit(mcf7_train_T_S_PCA100, Y_train)
cv_scores_LR_mcf7 = cross_val_score(grid_LR_mcf7, mcf7_train_T_S_PCA100, Y_train, cv=3, scoring='accuracy')
### LR model HCC1806 ###
grid_LR_hcc1806=GridSearchCV(model_LR, dict_params_LR)
grid_LR_hcc1806.fit(hcc1806_train_T_S_PCA100, Y_train2)
cv_scores_LR_hcc1806 = cross_val_score(grid_LR_hcc1806, hcc1806_train_T_S_PCA100, Y_train2, cv=3, scoring='accuracy')
print('Results Logistic Regression Model MCF7 cell')
print('*'*30)
print(cv_scores_LR_mcf7)
print('Average Performance:', cv_scores_LR_mcf7.mean())
print(cv_scores_LR_mcf7.std())
print('-'*30)
print('Results Logistic Regression Model HCC1806 cell')
print('*'*30)
print(cv_scores_LR_hcc1806)
print('Average Performance:', cv_scores_LR_hcc1806.mean())
print(cv_scores_LR_hcc1806.std())
Results Logistic Regression Model MCF7 cell ****************************** [1. 1. 1.] Average Performance: 1.0 0.0 ------------------------------ Results Logistic Regression Model HCC1806 cell ****************************** [0.95081967 0.95081967 0.96666667] Average Performance: 0.9561020036429873 0.007470344864994521
Logistic Regression model predicts Hypoxia condition better for MCF7 cell line than for HCC1802 (100% of accuracy vs 95% and 0 std of performance vs 0.01 std of performance). Logistic regression worked better than Random Forest models for both of the cell lines.
print('Best Logistic Regression Model obtained for MCF7 cell')
print('*'*30)
print(grid_LR_mcf7.best_params_)
print('-'*30)
print('Best Logistic Regression Model obtained for HCC1806 cell')
print('*'*30)
print(grid_LR_hcc1806.best_params_)
Best Logistic Regression Model obtained for MCF7 cell
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'newton-cg'}
------------------------------
Best Logistic Regression Model obtained for HCC1806 cell
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'newton-cg'}
Again, the best models obtained have the same hyperparameters for both cell types. We fit this model:
model_LR_best_mcf7 =LogisticRegression(
random_state=42,
C=grid_LR_mcf7.best_params_['C'],
multi_class=grid_LR_mcf7.best_params_['multi_class'],
solver=grid_LR_mcf7.best_params_['solver'],
)
model_LR_best_hcc1806 =LogisticRegression(
random_state=42,
C=grid_LR_hcc1806.best_params_['C'],
multi_class=grid_LR_hcc1806.best_params_['multi_class'],
solver=grid_LR_hcc1806.best_params_['solver'],
)
model_LR_best_mcf7 = model_LR_best_mcf7.fit(mcf7_train_T_S_PCA100, Y_train)
model_LR_best_hcc1806 = model_LR_best_hcc1806.fit(hcc1806_train_T_S_PCA100, Y_train2)
For both cell lines, we fit a Perceptron model. We find the best Perceptron model to fit using cross validation scores of the grid search. For each combination of values defined in the parameter dictionary, the best model is the one that performs better on average across 3 folds. We do not do more than 3 folds becase we do not have a lot of samples in the dataset.
from sklearn.linear_model import Perceptron
model_P=Perceptron(random_state=42)
dict_params_P= {
"penalty" : ("l2", "l1", "elasticnet", "None"),
"alpha" :[0.0001, 0.001, 0.01],
"eta0": [0.5, 1, 2, 5],
}
### P model MCF7 ###
grid_P_mcf7 = GridSearchCV(model_P, dict_params_P)
grid_P_mcf7.fit(mcf7_train_T_S_PCA100, Y_train)
cv_scores_P_mcf7 = cross_val_score(grid_P_mcf7, mcf7_train_T_S_PCA100, Y_train, cv=3, scoring='accuracy')
###P model HCC1806 ###
grid_P_hcc1806=GridSearchCV(model_P, dict_params_P)
grid_P_hcc1806.fit(hcc1806_train_T_S_PCA100, Y_train2)
cv_scores_P_hcc1806 = cross_val_score(grid_P_hcc1806, hcc1806_train_T_S_PCA100, Y_train2, cv=3, scoring='accuracy')
print('Results Perceptron Model MCF7 cell')
print('*'*30)
print(cv_scores_P_mcf7)
print('Average Performance:', cv_scores_P_mcf7.mean())
print(cv_scores_P_mcf7.std())
print('-'*30)
print('Results Perceptron Model HCC1806 cell')
print('*'*30)
print(cv_scores_P_hcc1806)
print('Average Performance:', cv_scores_P_hcc1806.mean())
print(cv_scores_P_hcc1806.std())
Results Perceptron Model MCF7 cell ****************************** [0.98809524 0.97590361 0.98795181] Average Performance: 0.9839835532606617 0.005713679573121927 ------------------------------ Results Perceptron Model HCC1806 cell ****************************** [0.8852459 0.93442623 0.93333333] Average Performance: 0.9176684881602913 0.022930571922559168
Perceptron model predicts Hypoxia condition much better for MCF7 cell line than for HCC1806 (98% of accuracy vs 92% and 0.005 std of performance vs 0.02 std of performance).
Perceptron worked better than Random Forest models, but worked worse than Logistic Regression for both MCF7 and HCC1802 cell lines.
print('Best Perceptron Model obtained for MCF7 cell')
print('*'*30)
print(grid_P_mcf7.best_params_)
print('-'*30)
print('Best Perceptron Model obtained for HCC1806 cell')
print('*'*30)
print(grid_P_hcc1806.best_params_)
Best Perceptron Model obtained for MCF7 cell
******************************
{'alpha': 0.0001, 'eta0': 0.5, 'penalty': 'l2'}
------------------------------
Best Perceptron Model obtained for HCC1806 cell
******************************
{'alpha': 0.01, 'eta0': 0.5, 'penalty': 'l1'}
Again, the best models obtained have the same hyperparameters for both cell types. We fit this model:
model_P_best_m =Perceptron(
random_state=42,
alpha=grid_P_mcf7.best_params_['alpha'],
eta0=grid_P_mcf7.best_params_['eta0'],
penalty=grid_P_mcf7.best_params_['penalty'],
)
model_P_best_mcf7 = model_P_best_m.fit(mcf7_train_T_S_PCA100, Y_train)
model_P_best_h =Perceptron(
random_state=42,
alpha=grid_P_hcc1806.best_params_['alpha'],
eta0=grid_P_hcc1806.best_params_['eta0'],
penalty=grid_P_hcc1806.best_params_['penalty'],
)
model_P_best_hcc1806 = model_P_best_h.fit(hcc1806_train_T_S_PCA100, Y_train2)
If we consider the cross validation scores, the best model for both cell types is Logistic Regression:
perf_cv_summary =[{
'Cell': 'hcc1806',
'Perceptron_mean_perf':cv_scores_P_hcc1806.mean(),
'LR_mean_perf':cv_scores_LR_hcc1806.mean(),
'RF_mean_perf':cv_scores_RF_hcc1806.mean(),
'Perceptron_std_perf':cv_scores_P_hcc1806.std(),
'LR_std_perf':cv_scores_LR_hcc1806.std(),
'RF_std_perf':cv_scores_RF_hcc1806.std()
},{
'Cell': 'mcf7',
'Perceptron_mean_perf':cv_scores_P_mcf7.mean(),
'LR_mean_perf':cv_scores_LR_mcf7.mean(),
'RF_mean_perf':cv_scores_RF_mcf7.mean(),
'Perceptron_std_perf':cv_scores_P_mcf7.std(),
'LR_std_perf':cv_scores_LR_mcf7.std(),
'RF_std_perf':cv_scores_RF_mcf7.std()
}
]
pd.DataFrame.from_dict(perf_cv_summary)
#pd.DataFrame.from_records(perf_cv_summary)
| Cell | Perceptron_mean_perf | LR_mean_perf | RF_mean_perf | Perceptron_std_perf | LR_std_perf | RF_std_perf | |
|---|---|---|---|---|---|---|---|
| 0 | hcc1806 | 0.917668 | 0.956102 | 0.774681 | 0.022931 | 0.00747 | 0.033939 |
| 1 | mcf7 | 0.983984 | 1.000000 | 0.759466 | 0.005714 | 0.00000 | 0.139878 |
--------- Memory Cleaning ------- START
alldfs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
print(alldfs) # df1, df2
['HCC1806_train_T', '_', '_233', 'mcf7_train_T']
--------- Memory Cleaning ------- END
You could test the classifier as predictor in a cell type where it was not developed. Does it predict well? As asked in the report template, we predict the oxygen condition in one dataset using the model built for the other dataset:
print('Score on mcf7 using the model for hcc1806:', model_LR_best_hcc1806.score(mcf7_train_T_S_PCA100, Y_train))
Score on mcf7 using the model for hcc1806: 0.764
print('Score on hcc1806 using the model for mcf7:', model_LR_best_mcf7.score(hcc1806_train_T_S_PCA100, Y_train2))
Score on hcc1806 using the model for mcf7: 0.5934065934065934
The model built for MCF7 works better than a random model on HCC1806 data (accuracy is more than 0.5), but the opposite is not true!
mcf7_test = pd.read_csv("SmartSeq/MCF7_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_test)) # 3000 expressions of different genes, 250 cells
print("First column: ", mcf7_test.iloc[ : , 0])
HCC1806_test = pd.read_csv("SmartSeq/HCC1806_SmartS_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_test)) # 3000 expressions of different genes, 250 cells
print("First column: ", HCC1806_test.iloc[ : , 0])
Dataframe dimensions: (3000, 63)
First column: "CYP1B1" 492
"CYP1B1-AS1" 253
"CYP1A1" 0
"NDRG1" 1157
"DDIT4" 6805
...
"GRIK5" 0
"SLC25A27" 0
"DENND5A" 285
"CDK5R1" 0
"FAM13A-AS1" 1
Name: "1", Length: 3000, dtype: int64
Dataframe dimensions: (3000, 45)
First column: "DDIT4" 0
"ANGPTL4" 0
"CALML5" 0
"KRT14" 169
"CCNB1" 233
...
"LINC02693" 48
"OR8B9P" 0
"NEAT1" 118
"ZDHHC23" 6
"ODAD2" 0
Name: "1", Length: 3000, dtype: int64
standardizer = StandardScaler()
standardizer.fit_transform(mcf7_train_T)
mcf7_test_T_S = standardizer.transform(mcf7_test.T)
mcf7_test_T_S_PCA100 = pca_mcf7.transform(mcf7_test_T_S)
standardizer = StandardScaler()
standardizer.fit_transform(HCC1806_train_T)
hcc1806_test_T_S = standardizer.transform(HCC1806_test.T)
hcc1806_test_T_S_PCA100 = pca_hcc1806.transform(hcc1806_test_T_S)
standardizer = StandardScaler()
standardizer.fit_transform(mcf7_train_T)
mcf7_test_T_S = standardizer.transform(mcf7_test.T)
mcf7_test_T_S_PCA100 = pca_mcf7.transform(mcf7_test_T_S)
standardizer = StandardScaler()
standardizer.fit_transform(HCC1806_train_T)
hcc1806_test_T_S = standardizer.transform(HCC1806_test.T)
hcc1806_test_T_S_PCA100 = pca_hcc1806.transform(hcc1806_test_T_S)
prediction_mcf7 = model_LR_best_mcf7.predict(mcf7_test_T_S_PCA100)
prediction_hcc1806 = model_LR_best_hcc1806.predict(hcc1806_test_T_S_PCA100)
mcf7_test_T = mcf7_test.T
mcf7_test_T['prediction'] = prediction_mcf7
hcc1806_test_T = HCC1806_test.T
hcc1806_test_T['prediction'] = prediction_hcc1806
We save our predictions into separate files as requested. The precitions are under the prediction column:
hcc1806_test_T[['prediction']].to_csv('hcc1806_test_T_SmartSeq.csv', sep ='\t')
mcf7_test_T[['prediction']].to_csv('mcf7_test_T_SmartSeq.csv', sep ='\t')
We build one model by concatenating the datasets of two cell types. We apply the same stepsof standardization and PCA transformation to the entire set.
smartSeq_entireSet_T = np.concatenate([mcf7_train_T, HCC1806_train_T])
standardizer = StandardScaler()
smartSeq_entireSet_T_S = standardizer.fit_transform(smartSeq_entireSet_T)
pca_all = PCA(random_state=101, n_components=100)
all_PCA100 = pca_all.fit_transform(smartSeq_entireSet_T_S)
smartSeq_Y = np.concatenate([Y_train, Y_train2])
from sklearn.ensemble import RandomForestClassifier
model_RF=RandomForestClassifier(random_state=42)
dict_params_RF = {
"n_estimators" : [1,2],
"criterion" :("gini", "entropy"),
"max_depth": [2,3],
"min_samples_split" : [5,],
"max_features": ("sqrt", "log2")
}
### RF model ###
grid_RF_all=GridSearchCV(model_RF, dict_params_RF)
grid_RF_all.fit(smartSeq_entireSet_T_S, smartSeq_Y)
cv_scores_RF_all = cross_val_score(grid_RF_all, smartSeq_entireSet_T_S, smartSeq_Y, cv=3, scoring='accuracy')
print('Best RF Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(grid_RF_all.best_params_)
print('Results RF Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(cv_scores_RF_all)
print('Average Performance:', cv_scores_RF_all.mean())
print('Std of Performance:', cv_scores_RF_all.std())
Best RF Model obtained for both MCF7 and HCC1806
******************************
{'criterion': 'gini', 'max_depth': 3, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 2}
Results RF Model obtained for both MCF7 and HCC1806
******************************
[0.79861111 0.90277778 0.50694444]
Average Performance: 0.736111111111111
Std of Performance: 0.16753247335853916
RF model obtained for all the cells together has average performance of 0.73. This is lower than RF models of individual cell types.
from sklearn.linear_model import LogisticRegression
model_LR=LogisticRegression(random_state=42)
dict_params_LR = {
"C" : [0.5, 1, 2, 5, 10],
"solver" :("newton-cg", "lbfgs", "liblinear"),
"multi_class": ("auto", "ovr"),
}
### LR model ###
grid_LR_all=GridSearchCV(model_LR, dict_params_LR)
grid_LR_all.fit(smartSeq_entireSet_T_S, smartSeq_Y)
cv_scores_LR_all = cross_val_score(grid_LR_all, smartSeq_entireSet_T_S, smartSeq_Y, cv=3, scoring='accuracy')
print('Best LR Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(grid_LR_all.best_params_)
print('Results LR Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(cv_scores_LR_all)
print('Average Performance:', cv_scores_LR_all.mean())
print('Std of Performance:', cv_scores_LR_all.std())
Best LR Model obtained for both MCF7 and HCC1806
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'newton-cg'}
Results LR Model obtained for both MCF7 and HCC1806
******************************
[1. 0.98611111 0.95138889]
Average Performance: 0.9791666666666666
Std of Performance: 0.02044389089427745
LR model obtained for all the cells together has average performance of 0.98. This is the best model among our all models.
from sklearn.linear_model import Perceptron
model_P=Perceptron(random_state=42)
dict_params_P= {
"penalty" : ("l2", "l1", "elasticnet", "None"),
"alpha" :[0.0001, 0.001, 0.01],
"eta0": [0.5, 1, 2, 5],
}
### P model ###
grid_P_all=GridSearchCV(model_P, dict_params_P)
grid_P_all.fit(smartSeq_entireSet_T_S, smartSeq_Y)
cv_scores_P_all = cross_val_score(grid_P_all, smartSeq_entireSet_T_S, smartSeq_Y, cv=3, scoring='accuracy')
print('Best P Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(grid_P_all.best_params_)
print('Results P Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(cv_scores_P_all)
print('Average Performance:', cv_scores_P_all.mean())
print('Std of Performance:', cv_scores_P_all.std())
Best P Model obtained for both MCF7 and HCC1806
******************************
{'alpha': 0.0001, 'eta0': 5, 'penalty': 'l2'}
Results P Model obtained for both MCF7 and HCC1806
******************************
[0.96527778 0.98611111 0.74305556]
Average Performance: 0.8981481481481483
Std of Performance: 0.10999633676404373
Perceptron model obtained for all the cells together has average performance of 0.89. This is lower than Perceptron models of individual cell types.
We build one model by concatenating the datasets of two cell types also for DropSeq. We apply the same steps of standardization and PCA transformation to the entire set.
mcf7_train_DropSeq = pd.read_csv("DropSeq/MCF7_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_train_DropSeq)) # 3000 expressions of different genes, 250 cells
print("First column: ", mcf7_train_DropSeq.iloc[ : , 0])
HCC1806_train_DropSeq = pd.read_csv("DropSeq/HCC1806_Filtered_Normalised_3000_Data_train.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_train_DropSeq)) # 3000 expressions of different genes, 250 cells
print("First column: ", HCC1806_train_DropSeq.iloc[ : , 0])
Dataframe dimensions: (3000, 21626)
First column: "MALAT1" 1
"MT-RNR2" 0
"NEAT1" 0
"H1-5" 0
"TFF1" 4
..
"BRWD1-AS2" 0
"RPS19BP1" 0
"AUNIP" 0
"TNK2" 0
"SUDS3" 0
Name: "AAAAACCTATCG_Normoxia", Length: 3000, dtype: int64
Dataframe dimensions: (3000, 14682)
First column: "H1-5" 2
"MALAT1" 3
"MT-RNR2" 0
"ARVCF" 0
"BCYRN1" 0
..
"SCCPDH" 0
"NTAN1" 0
"CLIP2" 0
"DUSP23" 0
"ZNF682" 0
Name: "AAAAAACCCGGC_Normoxia", Length: 3000, dtype: int64
mcf7_train_T_DropSeq = mcf7_train_DropSeq.T
HCC1806_train_T_DropSeq = HCC1806_train_DropSeq.T
DropSeq_entireSet_T = np.concatenate([mcf7_train_T_DropSeq, HCC1806_train_T_DropSeq])
standardizer = StandardScaler()
DropSeq_entireSet_T_S = standardizer.fit_transform(DropSeq_entireSet_T)
pca_all_DropSeq = PCA(random_state=101, n_components=100)
all_PCA100_DropSeq = pca_all.fit_transform(DropSeq_entireSet_T_S)
Y_train_DropSeq = mcf7_train_T_DropSeq.reset_index()['index'].apply(lambda x_str: 1 if 'Hypo' in x_str else 0)
Y_train2_DropSeq = HCC1806_train_T_DropSeq.reset_index()['index'].apply(lambda x_str: 1 if 'Hypo' in x_str else 0)
DropSeq_Y = np.concatenate([Y_train_DropSeq, Y_train2_DropSeq])
from sklearn.ensemble import RandomForestClassifier
model_RF=RandomForestClassifier(random_state=42)
dict_params_RF = {
"n_estimators" : [1,2],
"criterion" :("gini", "entropy"),
"max_depth": [2,3],
"min_samples_split" : [5,],
"max_features": ("sqrt", "log2")
}
### RF model ###
grid_RF_all_DropSeq=GridSearchCV(model_RF, dict_params_RF)
grid_RF_all_DropSeq.fit(all_PCA100_DropSeq, DropSeq_Y)
cv_scores_RF_all_DropSeq = cross_val_score(grid_RF_all_DropSeq, all_PCA100_DropSeq, DropSeq_Y, cv=3, scoring='accuracy')
print('Best RF Model obtained for both MCF7 and HCC1806 for DropSeq ')
print('*'*30)
print(grid_RF_all_DropSeq.best_params_)
print('Results RF Model obtained for both MCF7 and HCC1806 for DropSeq')
print('*'*30)
print(cv_scores_RF_all_DropSeq)
print('Average Performance:', cv_scores_RF_all_DropSeq.mean())
print('Std of Performance:', cv_scores_RF_all_DropSeq.std())
Best RF Model obtained for both MCF7 and HCC1806 for DropSeq
******************************
{'criterion': 'entropy', 'max_depth': 2, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 2}
Results RF Model obtained for both MCF7 and HCC1806 for DropSeq
******************************
[0.55515162 0.72387011 0.46529499]
Average Performance: 0.5814389075709209
Std of Performance: 0.10718687658233535
from sklearn.linear_model import LogisticRegression
model_LR=LogisticRegression(random_state=42, max_iter=1000)
dict_params_LR = {
"C" : [0.5, 1],
"solver" :("newton-cg", "lbfgs", "liblinear"),
"multi_class": ("auto", "ovr"),
}
### LR model ###
grid_LR_all_DropSeq=GridSearchCV(model_LR, dict_params_LR)
grid_LR_all_DropSeq.fit(all_PCA100_DropSeq, DropSeq_Y)
cv_scores_LR_all_DropSeq = cross_val_score(grid_LR_all_DropSeq, all_PCA100_DropSeq, DropSeq_Y, cv=3, scoring='accuracy')
print('Best LR Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(grid_LR_all_DropSeq.best_params_)
print('Results LR Model obtained for both MCF7 and HCC1806')
print('*'*30)
print(cv_scores_LR_all_DropSeq)
print('Average Performance:', cv_scores_LR_all_DropSeq.mean())
print('Std of Performance:', cv_scores_LR_all_DropSeq.std())
Best LR Model obtained for both MCF7 and HCC1806
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'newton-cg'}
Results LR Model obtained for both MCF7 and HCC1806
******************************
[0.96058828 0.9586053 0.5245414 ]
Average Performance: 0.8145783288275116
Std of Performance: 0.20508867827399468
from sklearn.linear_model import Perceptron
model_P=Perceptron(random_state=42)
dict_params_P= {
"penalty" : ("l2", "l1", "elasticnet", "None"),
"alpha" :[0.0001, 0.001, 0.01],
"eta0": [0.5, 1, 2, 5],
}
### P model ###
grid_P_all_DropSeq=GridSearchCV(model_P, dict_params_P)
grid_P_all_DropSeq.fit(all_PCA100_DropSeq, DropSeq_Y)
cv_scores_P_all_DropSeq = cross_val_score(grid_P_all_DropSeq, all_PCA100_DropSeq, DropSeq_Y, cv=3, scoring='accuracy')
print('Best Perceptron Model obtained for both MCF7 and HCC1806 for DropSeq ')
print('*'*30)
print(grid_P_all_DropSeq.best_params_)
print('Results Perceptron Model obtained for both MCF7 and HCC1806 for DropSeq')
print('*'*30)
print(cv_scores_P_all_DropSeq)
print('Average Performance:', cv_scores_P_all_DropSeq.mean())
print('Std of Performance:', cv_scores_P_all_DropSeq.std())
Best Perceptron Model obtained for both MCF7 and HCC1806 for DropSeq
******************************
{'alpha': 0.001, 'eta0': 0.5, 'penalty': 'l1'}
Results Perceptron Model obtained for both MCF7 and HCC1806 for DropSeq
******************************
[0.92249855 0.93728828 0.5221451 ]
Average Performance: 0.7939773098983718
Std of Performance: 0.19230920714020885
The ranking of the models are still the same: the most efficient is Logistic Regression, then Perceptron and as the least effecient one we have Random Forest. When we compare the models' efficiency among SmartSeq and DropSeq, we got better results at SmartSeq.
mcf7_test_DropSeq = pd.read_csv("DropSeq/MCF7_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(mcf7_test_DropSeq)) # 3000 expressions of different genes, 250 cells
print("First column: ", mcf7_test.iloc[ : , 0])
HCC1806_test_DropSeq = pd.read_csv("DropSeq/HCC1806_Filtered_Normalised_3000_Data_test_anonim.txt",delimiter="\ ",engine='python',index_col=0)
print("Dataframe dimensions:", np.shape(HCC1806_test_DropSeq)) # 3000 expressions of different genes, 250 cells
print("First column: ", HCC1806_test.iloc[ : , 0])
Dataframe dimensions: (3000, 5406)
First column: "CYP1B1" 492
"CYP1B1-AS1" 253
"CYP1A1" 0
"NDRG1" 1157
"DDIT4" 6805
...
"GRIK5" 0
"SLC25A27" 0
"DENND5A" 285
"CDK5R1" 0
"FAM13A-AS1" 1
Name: "1", Length: 3000, dtype: int64
Dataframe dimensions: (3000, 3671)
First column: "DDIT4" 0
"ANGPTL4" 0
"CALML5" 0
"KRT14" 169
"CCNB1" 233
...
"LINC02693" 48
"OR8B9P" 0
"NEAT1" 118
"ZDHHC23" 6
"ODAD2" 0
Name: "1", Length: 3000, dtype: int64
standardizer = StandardScaler()
mcf7_train_T_S_DropSeq = standardizer.fit_transform(mcf7_train_T_DropSeq)
standardizer = StandardScaler()
hcc1806_train_T_S_DropSeq = standardizer.fit_transform(HCC1806_train_T_DropSeq)
pca_mcf7_DropSeq = PCA(random_state=101, n_components=100)
mcf7_train_T_S_PCA100_DropSeq = pca_mcf7_DropSeq.fit_transform(mcf7_train_T_S_DropSeq)
pca_hcc1806_DropSeq = PCA(random_state=101, n_components=100)
hcc1806_train_T_S_PCA100_DropSeq = pca_hcc1806_DropSeq.fit_transform(hcc1806_train_T_S_DropSeq)
standardizer = StandardScaler()
standardizer.fit_transform(mcf7_train_T_DropSeq)
mcf7_test_T_S_DropSeq = standardizer.transform(mcf7_test_DropSeq.T)
mcf7_test_T_S_PCA100_DropSeq = pca_mcf7.transform(mcf7_test_T_S_DropSeq)
standardizer = StandardScaler()
standardizer.fit_transform(HCC1806_train_T_DropSeq)
hcc1806_test_T_S_DropSeq = standardizer.transform(HCC1806_test_DropSeq.T)
hcc1806_test_T_S_PCA100_DropSeq = pca_hcc1806.transform(hcc1806_test_T_S_DropSeq)
standardizer = StandardScaler()
standardizer.fit_transform(mcf7_train_T_DropSeq)
mcf7_test_T_S_DropSeq = standardizer.transform(mcf7_test_DropSeq.T)
mcf7_test_T_S_PCA100_DropSeq = pca_mcf7.transform(mcf7_test_T_S_DropSeq)
standardizer = StandardScaler()
standardizer.fit_transform(HCC1806_train_T_DropSeq)
hcc1806_test_T_S_DropSeq = standardizer.transform(HCC1806_test_DropSeq.T)
hcc1806_test_T_S_PCA100_DropSeq = pca_hcc1806.transform(hcc1806_test_T_S_DropSeq)
from sklearn.linear_model import LogisticRegression
model_LR=LogisticRegression(random_state=42)
dict_params_LR = {
"C" : [0.5, 1, 2, 5, 10],
"solver" :("newton-cg", "lbfgs", "liblinear"),
"multi_class": ("auto", "ovr"),
}
### LR model MCF7 ###
grid_LR_mcf7_DropSeq = GridSearchCV(model_LR, dict_params_LR)
grid_LR_mcf7_DropSeq.fit(mcf7_train_T_S_PCA100_DropSeq, Y_train_DropSeq)
cv_scores_LR_mcf7_DropSeq = cross_val_score(grid_LR_mcf7_DropSeq, mcf7_train_T_S_PCA100_DropSeq, Y_train_DropSeq, cv=3, scoring='accuracy')
### LR model HCC1806 ###
grid_LR_hcc1806_DropSeq=GridSearchCV(model_LR, dict_params_LR)
grid_LR_hcc1806_DropSeq.fit(hcc1806_train_T_S_PCA100_DropSeq, Y_train2_DropSeq)
cv_scores_LR_hcc1806_DropSeq = cross_val_score(grid_LR_hcc1806_DropSeq, hcc1806_train_T_S_PCA100_DropSeq, Y_train2_DropSeq, cv=3, scoring='accuracy')
print('Best Logistic Regression Model_DropSeq obtained for MCF7 cell')
print('*'*30)
print(grid_LR_mcf7_DropSeq.best_params_)
print('-'*30)
print('Best Logistic Regression Model_DropSeq obtained for HCC1806 cell')
print('*'*30)
print(grid_LR_hcc1806_DropSeq.best_params_)
Best Logistic Regression Model_DropSeq obtained for MCF7 cell
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'newton-cg'}
------------------------------
Best Logistic Regression Model_DropSeq obtained for HCC1806 cell
******************************
{'C': 0.5, 'multi_class': 'auto', 'solver': 'liblinear'}
model_LR_best_mcf7_DropSeq =LogisticRegression(
random_state=42,
C=grid_LR_mcf7_DropSeq.best_params_['C'],
multi_class=grid_LR_mcf7_DropSeq.best_params_['multi_class'],
solver=grid_LR_mcf7_DropSeq.best_params_['solver'],
)
model_LR_best_hcc1806_DropSeq =LogisticRegression(
random_state=42,
C=grid_LR_hcc1806_DropSeq.best_params_['C'],
multi_class=grid_LR_hcc1806_DropSeq.best_params_['multi_class'],
solver=grid_LR_hcc1806_DropSeq.best_params_['solver'],
)
model_LR_best_mcf7_DropSeq = model_LR_best_mcf7_DropSeq.fit(mcf7_train_T_S_PCA100_DropSeq, Y_train_DropSeq)
model_LR_best_hcc1806_DropSeq = model_LR_best_hcc1806_DropSeq.fit(hcc1806_train_T_S_PCA100_DropSeq, Y_train2_DropSeq)
prediction_mcf7_DropSeq = model_LR_best_mcf7_DropSeq.predict(mcf7_test_T_S_PCA100_DropSeq)
prediction_hcc1806_DropSeq = model_LR_best_hcc1806_DropSeq.predict(hcc1806_test_T_S_PCA100_DropSeq)
mcf7_test_T_DropSeq = mcf7_test_DropSeq.T
mcf7_test_T_DropSeq['prediction'] = prediction_mcf7_DropSeq
hcc1806_test_T_DropSeq = HCC1806_test_DropSeq.T
hcc1806_test_T_DropSeq['prediction'] = prediction_hcc1806_DropSeq
We save our predictions into separate files as requested. The precitions are under the prediction column:
hcc1806_test_T_DropSeq[['prediction']].to_csv('hcc1806_test_T_DropSeq.csv', sep ='\t')
mcf7_test_T_DropSeq[['prediction']].to_csv('mcf7_test_T_DropSeq.csv', sep ='\t')
drop_seq_table_hcc1806 = pd.read_csv("HCC1806_smart_table_pr.csv", delimiter=",",engine="python")
drop_seq_table_hcc1806.head()
| 0 | 1 | |
|---|---|---|
| 0 | 25 | 0 |
| 1 | 1 | 19 |
drop_seq_table_mcf7 = pd.read_csv("MCF7_smart_table_pr.csv", delimiter=",",engine="python")
drop_seq_table_mcf7.head()
| 0 | 1 | |
|---|---|---|
| 0 | 32 | 0 |
| 1 | 0 | 31 |
True Positives (TP) = 25 False Positives (FP) = 0 True Negatives (TN) = 19 False Negatives (FN) = 1
Accuracy = (TP + TN) / (TP + TN + FP + FN) Accuracy = (25 + 19) / (25 + 19 + 1) = 0.977 or 97%
Precision = TP / (TP + FP) Precision = 25 / (25) = 1 or 100%
Recall (Sensitivity) = TP / (TP + FN) Recall = 25 / (25 + 1) = 0.9615 or 96.15%
True Positives (TP) = 32 False Positives (FP) = 0 True Negatives (TN) = 31 False Negatives (FN) = 0
Accuracy = (TP + TN) / (TP + TN + FP + FN) Accuracy = (32 + 31) / (32 + 31 + 0 + 0) = 1 or 100%
Precision = TP / (TP + FP) Precision = 32 / (32 + 0) = 1 or 100%
Recall (Sensitivity) = TP / (TP + FN) Recall = 32 / (32 + 0) = 1 or 100%
In both cases, accuracy, precision, and recall show excellent performance, indicating a high level of accuracy and reliability of the model on both the HCC1806 and MCF7 cell lines.